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dimensional  point*,  where  local  i*  based  oa  the  projections  of  the  points  onto  the  curve  or  surface  of 
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A  number  of  linear  techniques,  such  as  factor  analysis  and  errors  in  variables  regression,  end 
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the  expected  squared  distance  between  the  points  and  the  curve.  Linear  principal  components  have 
this  property  as  well;  in  fact,  we  prove  that  if  a  principal  curve  is  straight,  then  it  is  a  principal 
component.  These  results  generalise  the  usual  duality  between  conditional  expectation  and  distance 


also  examine  two  sources  of  bias  in  the  procedures,  which  have  the  satisfactory 
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We  compare  the  principal  curve  and  surface  procedures  to  other  generalisations  of  principal 
components  in  the  literature;  the  usual  generalisations  transform  the  space,  whereas  we  transform 
the  model.  There  are  also  strong  ties  with  multidimensional  scaling. 
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Abstract 


Principal  curve*  are  sr  ooth  oa*  dimensional  curve*  that  pass  through  the  middle  of  a  p  dimensional 
data  set.  They  minimise  the  distance  from  the  points,  and  provide  a  non-linear  summary  of  the 
data.  The  curves  are  non- parametric  and  their  shape  is  suggested  by  the  data.  Similarly,  principal 
surfaces  are  two  dimensional  surfaces  that  pass  through  the  middle  of  the  data.  The  curves  and 
surfaces  are  found  using  an  iterative  procedure  which  starts  with  a  linear  summery  such  as  the  usual 
principal  component  fine  or  plane.  Each  successive  iteration  is  a  tmootk  or  local  average  of  the  p 
dimensional  points,  where  local  is  based  on  the  projections  of  the  points  onto  the  curve  or  surface  of 
the  previous  iteration. 

A  number  of  linear  techniques,  such  as  factor  analysis  and  errors  in  variables  regression,  end 
up  using  the  principal  components  ss  their  estimates  (after  a  suitable  scaling  of  the  co-ordinates). 
Principal  curves  and  surfaces  can  be  viewed  as  the  estimates  of  non- linear  generalisations  of  these 
procedures.  We  present  some  real  data  examples  that  illustrate  these  applications. 


Principal  Curves  (or  surfaces)  have  a  theoretical  definition  for  distributions:  they  are  the  Self 
Consistent  curves.  A  curve  is  self  consistent  if  each  point  on  the  curve  is  the  conditional  mean  of 
the  points  that  project  there.  The  main  theorem  proves  that  principal  curves  are  critical  values  of 
ths  expected  squared  distance  between  the  points  and  the  curve.  Linear  principal  components  have 
this  property  as  well;  in  fact,  we  prove  that  if  a  principal  curve  is  straight,  then  it  is  a  principal 
component.  These  results  generalise  the  usual  duality  between  conditional  expectation  and  distance 
minimisation.  We  also  examine  two  sources  of  bias  in  the  procedures,  which  have  the  satisfactory 
property  of  partially  cancelling  each  other. 

We  compare  the  principal  curve  and  surface  procedures  to  other1  generalisations  of  principal 
components  in  the  literature;  the  usual  generalisations  transform  the  space,  whereas  we  transform 
the  model.  There  are  also  strong  ties,  with  multidimensional  scaling. 
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Chapter  1 


Introduction 

Consider  a  data  set  consisting  of  n  observations  on  two  variables,  x  and  y.  We  can 
represent  the  n  points  is  a  scatterplot,  as  is.  figure  1.1.  It  is  natural  to  try  and  summarize 
the  joint  behaviour  exhibited  by  the  points  in  the  scatterplot.  The  form  of  summary  we 
chose  depends  on  the  goal  of  our  analysis.  A  trivial  summary  is  the  mean  vector  which 
simply  locates  the  center  of  the  cloud  but  conveys  no  in  >rmation  about  the  joint  behaviour 
of  the  two  variables. 


Figure  1.1  A  bivariate  data  set  represented  by  a  scatterplot. 

It  is  often  sensible  to  treat  one  of  the  variables  as  a  response  variable,  and  the  other 
as  an  explanatory  variable.  The  aim  of  the  analysis  is  then  to  seek  a  rule  for  predicting  the 
response  (or  average  response)  using  the  value  of  the  explanatory  variable.  Standard  linear 
regression  produces  a  linear  prediction  rule.  The  expectation  of  y  is  modeled  as  a  linear 
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function  of  x  and  is  estimated  by  least  squares.  This  procedure  is  equivalent  to  finding  the 
line  that  minimizes  the  sum  of  vertical  squrv  d  errors,  as  depicted  in  figure  1.2a. 

When  looking  at  such  a  regression  Vue,  it  is  natural  to  think  cl  it  as  a  summary  of  the 
data.  However,  m  constructing  this  summary  we  concerned  ourselves  only  with  errors  in 
the  response  variable.  In  many  situations  we  don’t  have  a  preferred  variable  that  we  wish 
to  label  response,  but  would  still  like  to  summarize  the  joint  behaviour  of  x  and  y.  The 
dashed  line  in  figure  1.2a  shows  what  happens  if  we  used  x  as  the  response.  So  simply 
assigning  the  role  of  response  to  one  of  the  variables  could  lead  to  a  poor  summary.  An 
obvious  alternative  is  to  summarize  the  data  by  a  straight  line  that  treats  the  two  variables 
symmetrically.  The  first  principal  component  line  in  figure  1.2b  does  just  this  —  it  is  found 
by  minimizing  the  orthogonal  errors. 

Linear  regression  has  been  generalized  to  include  nonlinear  functions  of  x.  This  has 
been  achieved  using  predefined  parametric  functions,  and  more  recently  non-par  ametric 
ecatterplot  smoothers  such  as  kernel  smoothers,  (Gasser  and  Muller  1979),  nearest  neighbor 
smoothers,  (Cleveland  1979,  Friedman  and  Stuetzle  1981),  and  spline  smoothers  (Reinsch 
1967).  La  general  scatterploc  smoothers  produce  a  smooth  curve  that  attempts  to  minimize 
the  vertical  errors  as  depicted  in  figure  1.2c.  The  non-paramstric  versions  listed  above 
allow  the  data  to  dictate  the  form  of  the  non-linear  dependency. 

In  this  dissertation  we  consider  similar  generalizations  for  the  symmetric  situation. 
Instead  of  summarizing  the  data  with  a  straight  line,  we  use  a  smooth  curve;  in  finding  the 
curve  we  treat  the  two  variables  symmetrically.  Such  curves  will  pass  through  the  middle 

I 

of  the  data  in  a  smooth  way  without  restricting  smooth  to  mean  linear,  or  for  that  matter 
without  implying  that  the  middle  of  the  data  is  a  straight  line.  This  situation  is  depicted 
in  figure  1.2d.  The  figure  suggests  that  such  curves  minimize  the  orthogonal  distances  to 
the  points.  It  turns  out  that  for  a  suitable  definition  of  middle  this  is  indeed  the  case.  We 
name  them  Principal  Curve).  If,  however,  the  data  cloud  is  ellipsoidal  in  shape  then  one 
could  well  imagine  that  a  stiaight  line  passes  through  the  middle  of  the  cloud.  In  this  case 
we  expect  our  principal  curve  to  be  straight  as  well. 

The  principal  component  line  plays  roles  other  them  that  of  a  data  summary: 

•  In  errors  in  variables  regression  the  explanatory  variables  are  observed  with  error  (as 
well  as  the  rssponse).  T  lis  can  occur  in  practice  when  both  variables  are  measurements 
of  some  underlying  vari  ibles,  and  there  is  error  in  the  measurements.  It  occurs  in 
observational  studies  wi  ere  neither  variable  is  fixed  by  design.  U  the  aim  of  the  analysis 
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is  prediction  o'  y  or  regression  and  if  the  z  variable  is  never  observed  without  error,  then 
the  best  we  can  do  is  condition  on  the  observed  z’s  and  perform  the  standard  regression 
analysis  (Madansky  1959,  Kendall  and  Stuart  1961,  Lindley  1947).  If,  however,  we  do 
expect  to  observe  z  without  error  then  we  can  model  the  expectation  of  y  as  a  linear 
function  of  the  systematic  component  of  z.  After  suitably  scaling  the  variables,  this 
model  is  estimated  by  the  principal  component  line. 

•  Often  we  want  to  replace  a  number  of  highly  correlated  variables  by  a  single  vari¬ 
able,  such  as  a  normalized  linear  combination  of  the  original  set.  The  first  principal 
component  is  the  normalized  linear  combination  with  the  largest  variance. 

•  In  factor  analysis  we  model  the  syetematic  component  of  the  data  as  linear  combina¬ 
tions  of  a  small  subset  of  new  unobservable  variables  called  factors.  In  many  cases 
the  models  are  estimated  using  the  linear  principal  components  summary.  Variations 
of  this  model  have  appeared  in  many  different  forms  in  the  literature.  These  include 
linear  functional  and  structural  models,  errors  in  variables  and  total  least  squares. 
(Anderson  1983,  Golub  and  van  Loan  1979). 

In  the  same  spirit  we  propose  using  principal  curves  as  the  estimates  of  the  systematic 
components  in  non-linear  versions  of  the  models  mentioned  above.  This  broadens  the  scope 
and  use  of  such  curves  considerably.  This  dissertation  deals  with  the  definition,  description 
and  estimation  of  such  principal  curves,  which  are  more  generally  one  dimensional  curves 
in  p-space.  When  we  have  three  or  more  variables  we  «■  m  carry  the  generalizations  further. 
We  can  think  of  modeling  the  data  with  a  2  or  more  dimensional  surface  in  p  space.  Let  us 
first  consider  only  three  variables  and  a  2-surface,  and  deal  with  each  of  the  four  situations 
in  figure  1  2in  turn. 

•  If  one  of  the  variables  is  a  response  variable,  then  the  usual  linear  regression  model 
estimates  the  conditional  expectation  of  y  given  Z  =  (zi,Zj)  by  the  least  squares 
plane.  This  is  a  planar  response  surface  which  is  once  again  obtained  by  minimizing 
the  squared  errors  in  y.  These  errors  ve  the  vertical  distances  between  y  and  the  point 
on  the  plane  vertically  above  or  below  y. 

•  Often  a  linear  response  surface  does  not  adequately  model  the  conditional  expectation. 
We  then  turn  to  nonlinear  two  dimensional  response  surfaces  which  are  smooth  surfaces 
that  minimize  the  vertical  errors.  They  are  estimated  by  surface  smoothen  that  are 
direct  extensions  of  the  seatterplot  smoothers  for  curve  estimation. 
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Figure  1.2a  The  linear  regression  line  mini-  Figure  1.2b  The  principal  component  line 

mises  the  sum  of  squared  errors  in  the  response  minimises  the  sum  of  squared  errors  in  all  the 

variable.  variables. 


Figure  1.2c  The  smooth  regression  carve 
minimises  the  sura  of  squared  errors  in  the 
response  variable,  subject  to  smoothness  con 
straints. 


Figure  1.2d  The  principal  curve  minimizes 
the  sum  of  squared  errors  in  all  the  variables, 
subject  to  smooths ese  constraints. 
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•  If  all  the  variables  are  to  be  treated  symmetrically  the  principal  component  plane  passes 
through  the  data  in  such  a  way  that  the  sum  of  squared  distances  from  the  points  to 
the  plane  is  minimised.  This  in  turn  is  an  estimate  for  the  systematic  component  in  a 
2-dimensional  linear  model  for  the  mean  of  the  three  variables. 

•  Finally,  in  this  symmetric  situation,  it  is  often  unnatural  to  assume  that  the  best  two 
dimensional  summary  is  a  plane.  Principal  surfaces  are  smooth  surfaces  that  pass 
through  the  middle  of  the  data  cloud;  they  minimize  the  sum  of  squared  distances 
between  the  points  and  the  surface.  They  can  also  be  thought  of  as  a  an  estimate  for 
the  two  dimensional  systematic  component  for  the  means  of  the  three  variables. 

These  surfaces  are  easily  generalized  to  2-dimensional  surfaces  in  p  space,  although  they 
are  hard  to  visualize  for  p  >  3. 

The  dissertation  is  organized  as  follows: 

•  In  chapter  2  we  discuss  in  more  detail  the  linear  principal  components  model,  as  well . 
as  the  linear  relationship  model  hinted  at  above.  They  are  identical  in  many  cases, 
and  we  attempt  to  tie  them  together  in  the  situation?'  where  this  is  possible.  We  then 
propose  the  non-linear  generalizations. 

•  In  Chapter  3  we  define  principal  curves  and  surfaces  in  detail.  We  motivate  an  al¬ 
gorithm  for  estimating  such  models,  and  demonstrate  the  algorithm  using  simulated 
data  with  very  definite  and  difficult  structure. 

•  Chapter  4  is  theoretical  in  nature,  and  proves  some  of  the  claims  in  the  previous  chap¬ 
ters.  The  main  result  in  this  chapter  is  a  theorem  which  shows,  that  curves  that  pass 
through  the  middle  of  the  data  are  in  fact  critical  points  of  a  distance  function.  The 
principal  curve  and  surface  procedures  are  inherently  biased.  This  chapter  concludes 
with  a  discussion  of  the  various  forms  and  severity  of  this  bi&s. 

•  Chapter  S  deals  with  the  algorithms  in  detail  There  is  a  brief  discussion  of  scatterplot 
smoothers,  and  we  show  how  to  deal  with  the  problem  of  finding  the  cloeest  point  on 
the  curve.  The  algorithm  is  explained  l?y  means  of  simple  examples,  and  a  method  for 
span  selection  is  given. 

•  Chapter  6  contains  six  examples  of  the  use  and  abilities  of  the  procedures  using  real 
and  simulated  data.  Some  of  the  examples  introduce  special  features  of  the  procedures 
such  as  inference  using  the  bootstrap,  robust  options  and  outlier  detection. 
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•  Chapter  7  prov.  dee  a  discussion  of  related  work  in  the  literature,  and  gives  details  of 
some  of  the  more  recent  ideas.  This  is  followed  by  some  concluding  remarks  on  the 
work  covered  in  this  dissertation. 


Chapter  2 


Background  and  Motivation 

Consider  a  data  matrix  X  with  n  rows  aad  p  columns.  The  matrix  c oasis ta  of  n  points  or 
vectors  with  p  coordinates  In  many  situations  the  matrix  will  have  arisen  as  n  observations 
of  a  vector  random  variable. 

2.1.  Linear  Principal  Components. 

The  first  (linear)  principal  component  is  the  normalised  linear  combination  of  the  p  variables 
with  the  largest  sample  variance.  It  is  convenient  to  think  of  X  ss  a  cloud  of  n  points  in 
p-epaee.  The  principal  component  is  then  the  length  of  the  projection  of  the  n  points  onto 
a  direction  vector.  The  vector  is  chosen  so  that  the  variance  of  the  projected  points  along 
it  is  largest.  Any  lips  parallel  to  this  vector  will  have  the  same  property.  To  tie  it  down  we 
insist  that  it  passes  through  ths  mean  vector.  This  line  then  has  ths  appealing  property  of 
being  the  lias  in  p  specs  that  is  closest  to  ths  data.  Closest  is  in  terms  of  average  squared 
euclidian  distance.  We  think  of  ths  projection  as  being  ths  best  linear  one  dimensional 
summary  of  ths  data  X.  Of  course  this  linear  summary  might  be  totally  inadequate  locally 
but  it  attempts  to  provide  a  reasonable  global  summary. 

Ths  theory  and  practical  issues  involved  in  linear  principal  components  analysis  are 
waQ  known  (Barnett  1981,  Gnanadesikaa  1977),  and  ths  technique  is  originally  due  to 
Spearman  (1904),  aid  then  later  developed  by  Hotelling  (1933).  We  can  find  ths  the 
second  component,  orthogonal  to  tbs  first,  that  has  the  next  highest  variance.  The  plans 
spanned  by  the  two  vectors  aad  including  tbs  mean  vector  is  ths  plans  closest  to  ths  data, 
hi  general  we  can  find  tbs  m  <  p  dimensional  hyperplans  that  contains  ths  most  variance, 
and  is  closest  to  the  data. 

The  solution  to  the  problem  is  obtained  by  computing  ths  singular  value  decomposi¬ 
tion  or  basic  structure  of  X,  (centered  with  respect  to  the  sample  mean  vector),  or  equiva¬ 
lently  the  eigen  decomposition  of  the  sample  covariance  matrix  (Golub  and  Reinsch  1970, 
Greenaere  1984).  Without  any  lam  in  generality  ws  assume  from  now  on  that  X  is  centered. 
If  this  is  not  ths  csss,  we  can  center  X,  perform  the  analysis,  and  uncenter  the  results  by 
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adding  back  the  mean  vector. 

In  particular,  the  fijvt  principal  component  direction  vector  a  is  the  largest  normalized 
eigenvector  of  5,  the  saraple  covariance  matrix.  The  principal  component  itself  is  Xo,  an 
a  vector  with  elemenra  A,-  =  z*e  where  *J  is  the  ith  row  of  X  and  A,-  is  the  one  dimensional 
summary  variable  for  the  tth  observation.  The  coordinates  in  p-space  of  the  projection  of 
the  tth  observation  on  s  are  given  by  * 

=  ~'*i  (21) 

There  is  no  underlying  model  in  the  above.  We  merely  regard  the  first  component 
as  a  good  summary  of  the  original  variables  if  it  accounts  for  a  large  fraction  of  the  total 


2.2.  A  linear  model  formulation. 

hi  this  section  we  describe  a  linear  model  formulation  for  the  p  variables.  This  formulation 
includes  many  familiar  models  such  as  linear  regression  and  factor  analysis.  We  end  up 
showing  in  2.2.2  that  the  estimation  of  the  systematic  component  of  some  of  these  models 
is  once  again  the  principal  component  procedure. 

,  3.3.1.  Outline  of  the  linear  model. 

Consider  a  model  for  the  observed  data 

*<=«<  +  «,■  (2.2) 

where  %  is  an  unobservable  systematic  component  and  an  unobservable  random  com- 
.  poncnt  (We  only  get  to  see  their  sum).  We  usually  impose  some  linear  structure  on  u,-, 

namely 

■(seo  +  AA*  (2.3) 

where  «o  is  constant  location  vector,  A  is  a  px  m  matrix  and  A* is  an  m-vector.  For  the 
procedures  considered  «o  »  always  sstimatsd  by  tot  rumple  mean  vector  S;  without  loss  of 
generality  we  will  simply  assume  that  X  has  been  centered  and  ignore  the  term  «o>  We  also 

•  It  X  is  aot  entered  we  enter  it  by  forming  X  ■  X  -  in.  Then  the  prrcipal  component  is . 
A  m  Xa  sad  the  estimate  in  p  space  for  the  projectioa  of  the  ith  observation  onto  the  principal 
compoaeat  has  •  +  art  h  *  +  «A<  -  t  +  os' (a*  -  B) 
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omaraa  that  «i  are  mutually  independent  and  identically  distributed  random  vectors  with 
mean  0  and  covariance  matrix  9  and  are  independent  of  the  Aj. 

H  the  A  are  considered  to  be  random  as  well,  the  model  is  referred  to  as  the  linear 
structural  model,  or  more  commonly  as  the  factor  analysis  model.  If  the  A,  are  fixed  it  is 
referred  as  the  linear  functional  model.  The  model  (2.3)  includes  some  familiar  models  as 

special  cases: 

e  Let  A  be  p  x  (p-  1)  with  rank  (p  -  1).  We  can  write  A  as 

C) 

where  a  is  a  (p  —  1)  vector  and  I  is  (p  —  1)  x  (p  -  1)  since  we  can  post-multiply  A  by 
an  arbitrary  nco-singular  (p  -  1)  x  (p  —  1)  matrix  and  pre-multiply  A,  by  its  inverse. 
Thus  we  can  write  the  model  (2.3)  as 

where  Efa)  =  0  and  assume  Coefe)  =  diag(e£,  *£,...,**).  Ife§  =  o\  =  ...  =  —  0 

then  we  have  the  usual  linear  rsgrsarion  model  with  response  *u  and  regressor  variables 

e  If  the  variances  are  not  aero  we  have  the  errors  in  variables  regression  model.  The  idea 
is  to  find  a  (p-  1)  dimensional  hyperplane  in  p* space  that  approximates  the  data  well. 
The  model  takes  care  of  errors  in  all  the  variables,  whereas  the  usual  linear  regression 
model  considers  errors  only  in  the  response  variable.  This  is  a  form  of  linear  functional 

analysis. 

,  e  When  the  A<  are  random  we  have  the  usual  factor  analysis  model,  which  includes  the 
random  effects  Anova.  This  is  also  referred  to  as  the  linear  structural  model. 

e  If  all  the  variances  are  sero  and  the  A^  are  random  and  A  is  pxp  the  model  represents 
the  principal  component  change  of  basis.  In  this  situation  it  is  clear  that  the  A^  are 
each  functions  of  the  s<. 

For  a  hill  treatment  of  the  above  models  see  Anderson  (1982). 
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2.2.2.  Estimation 

W«  return  for  simplicity  to  the  case  where  m  =  1.  Thus 

*  =  aA,  +  4  (2.5) 

The  systematic  components  aA,  are  points  in  p-apace  confined  to  the  line  defined  by  a 
multiple  A,-  of  the  vector  a.  We  need  to  estimate  A,  for  each  observation,  and  the  direction 
vector. 

We  now  state  some  results  which  can  be  found  in  Anderson  (1982). 

If  either 

e  the  *i  are  jointly  Normal  with  a  scalar  covariance  el,  where  c  is  possibly  unknown, 
and  if  Aj  are  random  or  fixed,  and  ere  estimate  by  maximum  likelihood 
or 

e  se  above  but  we  drop  the  Normal  assumption  and  estimate  by  least  squares, 

then  the  estimate  of  A*  is  once  again  the  first  principal  component  and  that  of  e  the  principal 
component  direction  vector.  In  both  cases  the  quantity  we  wish  to  minimise  is 

*SS(A,s)=£||x,-«A<|!*. 

It  is  easy  to  see  that  for  any  n  the  appropriate  value  for  A < 
point  onto  a.  Thao  equation  (2.6)  reduces  to 

*SS(s)  m  £  H*  - 

»  tr  XX*  -  s'X'X* 

The  normalised  solution  to  (2.7)  is  the  largest  eigenvector  of  X*X. 

If  the  error  covariance  ♦  is  general  but  known,  we  can  transform  the  problem  to  the 
previous  case.  This  is  the  same  as  using  the  Mahalanobis  distance  defined  in  terms  of  ♦. 
In  particular  when  6*  is  diagonal  the  procedure  amounts  to  finding  the  line  that  minimises 
the  weighted  distance  to  the  points  and  is  depicted  in  figure  (2.1)  below.' 

If  the  error  covariance  is  unknown  and  not  scalar  then  we  require  replicate  observations 
in  order  to  estimate  it. 


(2.6) 

is  obtained  by  projecting  the 

(2.7) 


Figure  2.1  If  ♦  »=  diag  (<rf ,  <r|)  then  w*  minimis*  th«  weighted 
di*te nee  from  the  points  to  the  line. 


2.2.3.  Units  of  measurement. 


It  is  often  s  problem  in  multivariate  data  analysis  that  variables  have  different  error  vari¬ 
ances,  even  though  they  are  measured  in  the  same  units.  A  worse  situation  is  that  often 
the  variables  are  measured  in  completely  different  and  incommensurable  units.  When  we 
use  least  squares  to  estimate  a  lower  dimensional  summary,  we  explicitly  combine  the  errors 
on  each  variable  using  the  usual  stun  of  components  loss  function,  as  in  (2.6)  .  This  then 
gives  equal  weight  to  each  of  the  components.  The  solution  is  thus  not  invariant  to  changes 
in  the  scale  of  any  of  the  variables.  This  is  easily  demonstrated  by  considering  a  spherical 
point  cloud.  If  we  scale  up  one  of  the  co-ordinates  an  arbitrary  amount,  w*  can  create 
as  much  linear  structure  ss  we  like.  In  this  situation  we  would  really  like  to  weigh  the 
errors  in  the  estimation  of  our  model  according  to  the  variance  of  the  measurement  errors, 
which  is  seldom  known.  The  safest  procedure  in  this  situation  is  to  standardise  each  of  the 
coordinates  to  have  unit  variance.  This  could  destroy  some  of  the  structure  that  exists  but 
without  further  knowledge  about  the  scale  of  the  components  this  yields  a  procedure  that 
is  invariant  to  coordinate  scale  transformations. 

If,  on  the  other  hand,  it  is  known  that  the  variables  are  measured  in  the  same  units, 
we  should  not  do  any  sealing  at  all.  An  apparent  counter-example  occurs  if  we  make 
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measurements  of  the  same  quantities  in  different  situations,  with  different  measurement 
devices.  An  example  might  be  taking  seismic  readings  at  different  sights  at  the  same 
instances  with  different  recording  devices.  If  the  error  variances  of  the  two  devices  are 
different,  we  would  want  to  scale  the  components  differently. 

To  sum  up  so  far,  the  principal  component  summary,  besides  being  a  convenient  data 
reduction  technique,  provides  us  with  the  estimate  of  a  formal  parametric  linear  mode! 
which  coven  a  wide  variety  of  situations.  An  original  example  of  the  one  factor  model 
given  here  is  that  of  Spearman  (1904).  The  x.-  are  scores  on  psychological  tests  and  the  X < 
is  some  underlying  unobservable  general  intelligence  factor. 

Hie  estimation  in  all  the  cases  amounts  to  finding  a  m-dimenaional  hyperplane  in 
p-epace  that  is  closest  to  the  points  in  some  metric. 

2.3.  A  non-linear  generalization  of  the  linear  model. 

The  above  formulation  is  often  very  restrictive  in  that  it  assumes  that  the  systematic  com* 
ponent  in  (2.2)  is  linear,  as  in  (2.3).  It  is  true  in  some  cases  that  we  can  approximate  a 
nonlinear  surface  by  its  first  order  linear  component.  In  other  cases  we  do  not  have  sufficient 
data  to  estimate  any  more  than  a  linear  component.  Apart  from  these  cases,  it  is  more 
reasonable  to  assume  a  model  of  the  form 

*<  =  /(*)  +  «  (2-8) 

where  A*  is  a  m-vector  so  before  and  /  is  a  p- vector  of  functions,  each  with  m  arguments.  The 
functions  are  required  to  be  smooth  relative  to  the  errors.  This  is  a  natural  generalization 
of  the  linear  model. 

This  dissertation  deals  with  a  generalization  of  the  linear  principal  components.  In¬ 
stead  of  finding  lines  and  planes  that  come  does  to  the  data,,  we  find  curves  and  surfaces. 
Just  as  the  linear  principal  components  are  estimates  for  the  variety  of  linear  models  listed 
above,  so  will  our  non-  linear  versions  be  estimates  for  models  of  the  form  (2.8)  .  So  in 
addition  to  having  a  more  general  summary  of  multidimensional  data,  we  provide  a  means 
of  estimating  the  systematic  component  in  a  large  class  of  models  suitably  generalized  to 
include  no u- linearities.  We  refer  to  these  summaries  as  principal  curves  and  surfaces. 

So  far  the  discussion  has  concentrated  on  data  sets.  We  can  just  as  well  formulate  the 
above  models  for  p  dimensional  probability  distributions.  We  would  then  regard  the  data  set 
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m  a  sample  from  this  distribution  and  the  functions  derived  for  the  data  set  will  be  regarded 
as  estimates  of  the  corresponding  functions  defined  for  the  distribution.  These  models  then 
define  one  and  two  dimensional  surfaces  that  summarise  the  p  dimensional  distribution. 
The  point  /(A)  or  the  surface  that  corresponds  to  a  general  point  x  from  the  distribution 
is  a  p  dimensional  random  variable  that  can  be  summarised  by  a  two  dimensional  random 
variable  X. 

2.4.  Other  generalizations. 

There  have  a  number  of  generalisations  of  the  principal  component  model  suggested 
ip.  the  literature. 

•  “Generalised  principal  components*  usually  refers  to  the  adaptation  of  the  linear  model 
in  which  the  coordinates  are  first  transformed,  and  then  the  standard  principal  com¬ 
ponent  analysis  is  carried  out  on  the  transformed  coordinates. 

•  Multidimensional  scaling  (MDS)  finds  a  low  dimensional  representation  for  the  high 

dimensional  point  cloud,  such  that  the  sum  of  squared  interpoint  distances  are  pre¬ 
served.  This  constraint  has  been  modified  in  certain  cases  to  cater  only  for  points  that 
are  in  the  original  space. 

e  Prcxinuty  analysis  provides  parametric  representations  for  data  without  noise. 

•  Non  -linear  factor  analysis  is  a  generalization  similar  to  ours,  except  parametric  co¬ 
ordinate  functions  are  used. 

We  have  been  deliberately  brief  in  listing  these  alternatives.  Chapter  7  contains  a  detailed 
discussion  and  comparison  of  each  of  the  above  with  the  principal  curve  and  surface  models. 


Chapter  3 


The  Principal  Curve  and  Surface  models 


In  (hi*  chapter  we  define  the  principal  curve  and  surface  models,  first  for  a  p  dimensional 
probability  distribution,  and  then  for  a  p  dimensional  finite  data  set.  In  order  to  achieve 
some  continuity  in  the  presentation,  we  motivate  and  then  simply  state  results  and  theorems 
in  this  chapter,  and  prove  them  In  chapter  4. 

3.1.  The  principal  curves  of  a  probability  distribution. 

We  first  give  a  brief  introduction  to  one  dimensional  surfaces  or  curves,  and  then  define  the 
principal  curves  of  smooth  probability  distributions  in  p  space. 

3.1.1.  One  dimensional  curves. 

A  one  dimensional  curve  /  is  a  vector  of  functions  of  a  single  variable,  which  we  denote  by 
A.  These  functions  are  called  the  coordinate  functions,  and  X  provides  an  ordering  along 
the  curve.-  If  the  coordinate  functions  are  smooth,  then  /  will  be  a  smooth  curve.  We  can 
clearly  make  any  monotone  transformation  to  A,  say  m(A),  and  by  modifying  the  coordinate 
functions  appropriately  the  curve  remains  unchanged.  The  parametrization,  however,  is 
different.  There  is  a  natural  parametrization  for  curves  in  terms  of  the  arc-length.  The 
arc-length  of  a  curve  /  from  Ao  to  Aj  is  given  by 

l=  f  II/'WII dz- 

* 

If  ||/'(*)||  3  1  then  /  =  Aj— Ao.  This  is  a  rather  desirable  situation,  since  if  all  the  coordinate 
variables  are  in  the  asms  units  of  measurement,  then  A  is  also  in  those  units.  The  vector 
/*( A)  is  tangent  to  the  curve  at  A  and  is  sometimes  called  the  velocity  vector  at  A.  A  curve 
with  j|/*j)  3  1  is  called  a  unit  speed  parametrized  curve.  We  can  always  reparametrize  any 
smooth  curve  to  make  it  unit  speed.  If  *  is  a  unit  vector,  then  /(A)  —  ro  +•  A*  is  a  unit 
speed  ttreigkt  curve. 

The  vector  /"(A)  is  called  the  acceleration  of  the  curve  at  A,  and  for  a  unit  speed 
curve,  it  is  easy  to  check  that  it  is  orthogonal  to  the  tangent  vector,  hi  this  case  /*/  ||/*|| 
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Figure  (3.1)  The  radios  of  curvature  is  the  radius  of  the  circle 
tangent  to  the  curve  with  the  same  acceleration  as  the  curve. 

is  called  the  principal  normal  of  the  curve  at  A.  Since  the  acceleration  measures  the  rate 
and  direction  in  which  the  tangent  vector  turns,  it  is  not  surprising  that  the  curvature  of 
a  parametrized  curve  is  defined  in  terms  of  it.  The  easiest  way  to  think  of  curvature  is  in 
terms  of  a  circle.  We  fit  a  circle  tangent  to  the  curve  at  a  particular  point  and  lying  in  the 
plane  spanned  by  the  velocity  vector  and  the  principal  normal.  The  circle  is  constructed  to 
have  the  same  acceleration  as  the  curve,  and  the  radius  of  curvature  of  the  curve  at  that 
point  is  defined  as  the  radius  of  the  circle.  It  is  easy  to  check  that  for  a  unit  speed  curve 
we  get 

dcf 

ry( A)  s=  radius  of  curvature  of  /  at  A 

-vimi 

The  center  of  curvature  of  the  curve  at  A  is  denoted  by  e^(A)  and  is  the  center  of  this  circle. 

3.1.2.  Definition  of  principal  curves. 

We  now  define  what  we  mean  by  a  curve  that  passes  through  the  middle  of  the  data  —  what 
we  call  a  principal  curve.  Figure  3.2  represents  such  a  curve.  At  any  particular  location 
on  the  curve,  we  collect  all  the  points  in  p  space  that  have  that  location  as  their  closest 
point  on  the  curve.  Looeely  speaking,  we  collect  all  the  points  that  project  there.  Then 
the  location  on  the  curve  is  the  average  of  these  points.  Any  curve  that  has  this  property 
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Figure  (3.2)  Each  point  on  a  principal  curve  ia  the  average  of  th-j 
points  that  project  there. 

is  called  a  principal  curve.  One  might  say  that  principal  curves  are  their  own  conditional 
expectation.  We  will  prove  later  these  curves  axe  critical  points  of  a  distance  function,  as 
are  the  principal  components. 

In  the  figure  we  have  actually  shown  the  points  that  project  into  a  neighborhood  on 
the  curve.  We  do  this  because  usually  for  finite  data  sets  at  most  one  data  point  projects 
at  any  particular  spot  on  the  curve.  Notice  that  the  points  lie  in  a  segment  with  center  at 
the  center  of  curvature  of  the  arc  in  question.  We  will  discuss  this  phenomenon  in  more 
detail  in  the  section  on  bias  in  chapter  4. 

We  can  formalize  the  above  definition.  Suppose  X  is  a  random  vector  in  p-apace, 
with  continuous  probability  density  h{z).  Let  g  be  the  class  of  differentiable  1-dimensional 
curves  in  1R*,  parametrized  by  A.  In  addition  we  do  not  allow  curves  that  form  closed  bops, 
so  they  may  not  intersect  themselves  or  be  tangent  to  themselves.  Suppose  A  €  Af  for  each 
/  in  Q.  For  /  €  $  and  z  6  1RP,  we  define  the  projection  index  \j  :  JR*  *-*  by 

A,(*)  =  max{A  :  ||*  - /(A)||  =  inf  I),  - /(M)j|). 


(3.1) 
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The  projection  index  A j{%)  of  x  is  the  value  of  A  for  which  /(A)  is  closest  to  x.  There  might 
be  a  number  of  such  points  (suppose  /  is  a  circle  and  x  is  at  the  center),  so  we  pick  the 
largest  such  value  of  A.  We  will  show  in  chapter  4  that  A^(x)  is  a  measureable  mapping 
from  R*  to  R1,  and  thus  A j(X)  is  a  random  variable. 

Definition 

The  Principal  Curves  of  h  are  those  members  of  Q  which  axe  self  consistent.  A  curve  /  €  Q 
is  self  consistent  if 

B(X  |A/(.X)  =  A)  =  /(A)  VA  €  A; 

We  call  the  class  of  principal  curves  T(h). 

3.1.3.  Existence  of  principal  curves. 

An  immediate  question  might  be  whether  such  curves  exist  or  not,  and  for  what  kinds  of 
distributions.  It  is  easy  to  check  that  for  ellipsoidal  distributions,  the  principal  components 
are  in  fact  principal  curves.  For  a  spherically  symmetric  distribution,  any  line  through  the 
mean  vector  b  a  principal  curve. 

What  about  data  generated  from  a  model  as  in  equation  2.8,  where  A,-  is  1  dimensional? 
Is  /  a  principal  curve  for  this  distribution?  The  answer  in  general  is  no.  Before  we  even 
try  to  answer  it,  we  have  to  enquire  about  the  distribution  of  A,-  and  e,-.  Suppose  that  the 
data  is  well  behaved  in  that  the  distribution  of  e,-  has  tight  enough  support,  so  that  no 
points  can  fall  beyond  the  centers  of  curvature  of  /.  This  guarantees  that  each  point  has 
a  unique  closest  point  to  the  curve.  We  show  in  the  next  chapter  that  even  under  these 
ideal  conditions  (spherically  symmetric  errors,  slowly  changing  curvature)  the  average  of 
points  that  project  at  a  particular  point  on  the  curve  from  which  they  are  generated  lies 
outside  the  circle  of  curvature  at  that  point  on  the  curve.  This  means  that  the  principal 
curve  will  be  different  from  the  generating  curve.  So  in  this  situation  an  unbiased  estimate 
of  the  principal  curve  will  be  a  biased  estimate  of  the  functional  model.  This  bias,  he  <sr, 
is  small  and  decreases  to  zero  as  the  variance  of  the  errors  gets  small  relative  to  the  radius 
of  curvature. 

3.1.4.  The  distance  property  of  principal  curves. 

The  principal  components  are  critical  points  of  the  squared  distance  from  the  points  to  their 
projections  on  straight  curves  (lines).  Is  there  any  analogous  property  for  principal  curves? 
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It  turns  out  that  there  is.  Let  d(x,  f)  denote  the  usual  euclidian  distance  from  a  point  x  to 
its  projection  on  the  curve  /: 

d(*,/)<£||z-/(A/(*))||  (3.2) 

and  define  the  function  D*  :  $  — *  1R1  by 

D></)  *  EJ*(X,/). 

We  show  that  if  we  restrict  the  curves  to  be  straight  lines,  then  the  principal  components 
are  the  only  critical  values  of  2J*(/).  Critical  value  here  is  in  the  variational  sense:  if  /  and 
g  are  straight  lines  and  we  form  /<  =  /  +  eg,  then  we  define  /  to  be  a  critical  value  of  D2  iff 

dD\U)fde  j<=0  =  0. 

This  means  that  they  are  minima,  maxima  or  saddle  points  of  this  distance  function.  If  we 
restrict  /  and  g  to  be  members  of  the  subset  of  Q  of  curves  defined  bn  a  compact  A,  then 
principal  curves  have  this  property  as  well.  In  this  case  /«  describes  a  class  of  curves  about 
/  that  shrink  in  as  e  gets  small.  The  corresponding  result  is:  dD2(ft)/de\t=o  =  0  iff  /  is 
a  principal  curve  of  h.  This  is  a  key  property  and  is  an  essential  link  to  all  the  previous 
models  and  motivation  in  chapter  2.  This  property  is  similar  to  that  enjoyed  by  conditional 
expectations  or  projections;  the  residual  distance  is  minimized.  Figure  (3.3)  illustrates  the 
idea,  and  in  fact  is  almost  a  proof  in  one  direction. 

Suppose  k  is  not  a  principal  curve.  Then  the  curve  defined  by  /(A)  =  E(X  |  Afc(AC)  = 
A)  certainly  gets  closer  to  the  points  in  any  of  the  neighborhoods  than  the  original  curve. 
This  is  the  property  of  conditional  expectation.  Now  the  points  in  any  neighborhood  defined 
by  A^  might  end  up  in  different  neighborhoods  when  projected  onto  /,  but  this  reduces  the 
distances  even  further.  This  shows  that  k  cannot  be  a  critical  value  of  the  distance  function. 

An  immediate  consequence  of  these  two  results  is  that  if  a  principal  curve  is  a  straight 
line,  then  i.‘  Ju  a  principal  component.  Another  result  is  that  principal  components  are  self 
consistent  if  we  replace  conditional  expectations  by  linear  project  'ns. 

3.1.4.1  A  smooth  subset  of  principal  curves. 

We  have  defined  principal  curves  in  a  rather  general  fashion  without  any  smoothness  re¬ 
strictions.  The  distance  theorem  tells  us  that  if  we  have  a  principal  curve,  we  will  not  find 
any  curves  nearby  with  the  same  expected  distance.  We  have  a  mental  image  of  what  we 
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Figure  3.3  Th«  conditional  expectation  curve  gets  at  leaat  aa  cloee 
to  the  points  an  the  original  curve. 

would  like  the  carves  to  look  like.  They  should  pass  through  the  date  smoothly  enough  so 
that  each  data  point  has  an  unambiguous  closest  point  on  the  curve.  This  smoothness  will 
be  dictated  by  the  density  h.  It  turns  out  that  we  can  neatly  summarize  this  requirement. 
Consider  the  subset  Tc(h)  C  T(h)  of  principal  curves  of  A,  where  /  €  Te(h)  iff  /  €  T[h) 
and  Xj(x)  is  continuous  in  x  for  all  points  x  in  the  support  of  h.  In  words  this  says  that  if 
two  points  x  and  y  are  close  together,  then  their  points  of  projection  on  the  curve  are  dose 
together.  This  has  a  number  of  implications,  si' me  of  which  are  obvious,  which  we  will  list 
now  and  prove  later. 

•  There  is  only  one  closest  point  on  the  principal  curve  for,  each  x  in  the  support  of  h. 

•  The  curve  is  globally  well  behaved.  This  means  that  the  curve  cannot  bend  back  add 
come  too  doss  to  itaelf  since  that  will  lead  to  ambiguities  in  projection.  (If  we  want 
to  deal  with  closed  curves,  such  aa  a  circle,  a  technical  modification  in  the  definition 
of  A  is  required). 

•  There  are  no  points  at  or  beyond  the  centers  of  curvature  of  the  curve.  This  says  that 
the  curve  is  smooth  relative  to  the  variance  of  the  data  about  the  curve.  This  has 
intuitive  appeal.  If  the  data  is  very  noisy,  we  cannot  hope  to  recover  more  than  a  very 
smooth  curve  (nearly  a  straight  line)  from  it. 
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Figure  3.4  TV*  coat  amity  coutniat  avoid*  global  ambigmitias  (a) 
ud  local  ambtgmiti**  (b)  ia  projtctia*. 

Figure  3.4  illustrate*  tb*  way  in  which  th«  continuity  constraint  avoids  global  and  local 
ambiguities.  Notice  that  T,(h)  depends  on  the  density  h  of  X.  W#  say  is  the  support  of 
h,  but  if  the  errors  have  an  infinite  range,  this  definition  would  only  allow  straight  lines. 
We  can  make  some  technical  modifications  to  overcome  this  hurdle,  such  as  insisting  that  h 
has  compact  support.  This  rules  out  any  theoretical  consideration  of  curves  with  gauasian 
errors,  although  in  practice  we  always  have  compact  support.  Nevertheless,  the  class  fe(h) 
will  prove  to  be  useful  is  understanding  so  ms  of  the  properties  of  principal  curves. 

3.2.  The  principal  surfaces  of  a  probability  distribution. 

4.2.1.  Two  dimensional  surfaces. 

The  level  of  difficulty  increases  dramatically  as  ws  move  from  on*  dimensional  surfaces  or 
curves  to  higher  dimensional  surface*,  fa  this  work  we  will  only  deal  with  2-dimensional  sur¬ 
faces  in  p  space.  In  fact  we  shall  deal  only  with  2-surfaces  that  admit  a  global  parametriza- 
tion.  rhi#  allows  us  to  define  /  to  be  a  smooth  2-dimensional  globally  parametrised  surface 


Chapter  S:  The  Principal  Curve  and  Surface  models  21 


if  f  :  A*-*  R*  for  i  C  R*  ii  a  vector  of  smooth  functions: 

ihW\ 


\f,WJ 

r/l(Al,A1)'' 

MAl.A,) 

k./s(Ai,Aj)> 


(3.3) 


Another  way  of  defining  s  2-eurface  in  p  space  is  to  have  p  —  2  constraints  on  the  p  coordi¬ 
nates.  An  example  is  the  unit  sphere  in  R*.  It  can  be  defined  as  {x  :  *  6  Ra,  ||x||  =  1}. 
Thera  is  one  constraint.  We  will  call  this  the  tmytictt  definition. 

Not  all  2-eurfaeas  have  implicit  definitions  (mobius  band),  and  similarly  not  all  surfaces 
have  global  parametria  aliens.  However,  locally  an  equivalence  can  be  established  (Thorpe 
1978). 

The  concept  of  arc-length  generalises  to  surface  area.  However,  we  cannot  always  re- 
paramsthse  the  surface  so  that  units  of  area  in  the  parameter  space  correspond  to  units  of 
area  in  the  surface.  Ones  again,  local  parametrizaiioos  do  permit  this  change  of  units. 


Curvature  also  takes  on  another  dimension.  The  curvature  of  a  surface  at  any  point 
might  be  different  depending  on  which  direction  we  look  from.  The  way  this  is  resolved 
is  to  look  from  all  possible  directions,  and  the  first  principal  curvature  is  the  curvature 
corresponding  to  the  direction  in  ehich  the  curvatuie  is  greatest.  The  second  principal 
curvature  corresponds  to  the  largest  curvature  in  a  d  xection  orthogonal  to  the  first.  For 
2-surfaces  there  are  only  two  orthogcnal  directions,  so  we  are  done. 


S.2.2.  Definition  of  principal  surfaces. 


Ones  again  let  X  be  a  random  vector  in  p-epace,  with 
Let  $*  be  the  class  of  differentiable  2-dimensional  surfi 
a  2-dimensional  parameter  vector. 

For  f  €  $*  and  a  €  R',  we  define  the  projection 


continuous  probability  density  h(x). 
jaces  in  R',  parametrised  by  A  €  Ay, 


index  Ay(z)  by 


A /(•)  *  m»*max{A :  ||s  -  /(A)||  inf  jj*  -  /(p)||). 


(3.4) 
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Tb«  projection  index  define*  the  closest  point  on  the  surface;  if  there  is  more  than  one,  it 
picks  the  one  with  the  largest  first  component.  If  this  is  still  not  unique,  it  then  maximizes 
ower  the  second  component.  Once  again  A  j[x)  is  a  measure  able  mapping  from  Rp  into  1R*, 
and  A f(X)  is  a  random  sector. 

Definition 

Tbs  Principal  Surf  acme  of  A  are  those  members  of  which  are  self  consistent: 

1(X|A,(X)  =  A)  =  /(A) 

Figure  (3.5)  demonstrates  the  situation. 


Figure  3.5  Each  poiat  os  a  principal  surface  is  the  average  of  the 
poiate  that  project  there. 

The  plan*  spanned  by  the  first  and  second  principal  components  minimizes  the  distance 
from  the  points  to  their  projections  onto  any  plane.  Once  again  let  d(x,  f)  denote  the  usual 
euclidian  distance  from  a  point  s  to  its  projection  on  the  surface  /,  and  Z?*(/)  =  E«f*(X,  /). 
If  the  surfaces  are  restricted  to  be  planes,  then  the  planes  spanned  by  any  pair  of  principal 
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component*  are  the  only  critical  values  of  £*(/)■  There  is  a  result  analogous  to  the  one 
to  be  proven  for  principal  curves.  If  we  restrict  /  to  be  the  members  of  §*  defined  on 
connected  compact  sets  in  R*,  then  the  principal  surfaces  of  h  are  the  only  critical  values 
° (D\f). 

Let  7*{h)  C  G *  denote  the  class  of  principal  2-surfaces  of  k.  Once  again  we  consider  a 
smooth  subset  of  this  class.  Form  the  subset  72(h)  C  77(h),  where  /  e  7}  (h)  iff  /  €  77(h) 
and  \f(  x)  is  continuous  in  x  for  all  points  s  in  the  support  of  h.  Surfaces  in  72(h)  have 
the  following  properties. 

•  There  is  only  one  closest  point  on  the  principal  surface  for  each  x  in  the  support  of  h. 

•  The  surface  is  globally  well  behaved,  in  that  it  cannot  fold  back  upon  itself  causing 
ambiguities  in  projection. 

•  We  saw  that  for  principal  curves  in  7,(h),  there  are  no  points  at  or  beyond  the  centers 
of  curvature  of  the  curve.  The  analogous  statement  for  principal  surfaces  in  72(h)  is 
that  there  are  no  points  at  or  beyond  the  centers  of  normal  curvature  of  any  unit  speed 
curve  in  the  surface. 

3.3.  An  algorithm  for  finding  principal  curves  and  surfaces. 

We  are  still  in  the  theoretical  situation  of  finding  principal  curves  or  surfaces  for  a  probability 
distribution.  We  will  refer  to  curves  (1-dimensional  surfaces)  and  2-dimensional  surfaces 
jointly  as  surfaces  in  situations  where  the  distinction  is  not  important.  , 

When  seeking  principal  surfaces  or  critical  values  of  D7(f),  it  is  natural  to  look  for  a 
smooth  curve  that  corresponds  to  a  local  minimum.  Our  strategy  is  to  start  with  a  smooth 
curve  and  then  to  lode  around  it  for  a  local  minimum.  Recall  that 

!>»(/)=  ■|x-/(A,(Jf))|*  (3.5) 

-  *A/(X)*[|^--/(A/W)ir  !*/(*>]•  (*■•«) 

We  can  writ*  this  as  a  minimisation  problem  in  f  and  A:  find  /  and  A  such  that 

D’(/,A)=  E||X-/(1)||*  (3.7) 

is  a  minimum.  Clearly,  given  any  candidate  solution  /  and  A,  /  and  A j  is  at  least  as  good. 
Two  key  ideas  emerge  from  this: 
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•  If  we  knew  /  a*  a  function  of  A,  then  we  could  minimise  (3.7)  by  picking  A  =  A j(z) 
at  each  point  s  in  the  rapport  of  A. 

•  Suppose,  on  the  o' Iter  hand,  that  we  had  a  function  A(z).  We  could  rewrite  (3  7)  aa: 

#(/,*)  =  *X(X)  E  ■[(*>  -  /i(A(X))*  |A(X)1  (3.8) 

We  could  minimise  D\  by  choosing  each  fj  separately  so  aa  to  minimise  the  corre¬ 
sponding  term  in  the  sum  in  (3.8)  .  This  amounts  to  choosing 

/j(A)  =  E(X,|A(X)  =  A).  (3.9) 

In  this  last  step  we  have  to  check  that  the  new  /  is  differentiable.  One  can  construct  many 
situations  where  this  is  not  the  case  by  allowing  the  starting  curve  to  be  globally  wild.  On 
the  other  hand,  if  the  starting  curve  is  well  behaved,  the  sets  of  projection  at  a  particular 
point  in  the  curve  or  surface  lie  in  the  normal  hyperplanes  which  vary  smoothly.  Since  the 
density  h  is  smooth  we  can  expect  that  the  conditional  expectation  in  (3.9)  will  define  a 
smooth  function.  We  give  more  details  in  the  next  chapter.  The  above  preamble  motivates 
the  following  iterative  algorithm. 

Principal  surface  algorithm 

Initialisation:  Set  /^(A)  =  AA  where  A  is  either  a  column  vector  (principal 
curves)  and  is  the  direction  vector  of  the  first  linear  principal 
component  ofAorAisapx2  matrix  (principal  surfaces)  con¬ 
sisting  of  the  first  two  principal  component  direction  vectors. 

Set  A<°>  =  A/w. 

repeat:  over  iteration  counter  j 

1)  Set  /W(.)  «  E(X  |Atf-**)(X)  =  •). 

2)  Chooee  AW  =  A/y). 

3)  Evaluate  0*W  =*  0j(/W,iW). 
until:  D*  W  fails  to  decrease. 

Although  we  start  with  the  linear  principal  component  solution,  any  reasonable  starting 
values  can  be  used. 
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It  is  eujr  to  check  thst  the  criterion  D 3  (*)  must  converge.  It  is  positive  and  bounded 
below  by  0.  Suppose  we  have  /k'-1)  and  A^-1).  Now  DftfW,  A^'-1))  <  Dj(/k’-l),Ak_1)) 
by  the  propertiee  ai conditional  expectation.  Also  \M)  <  D*(/W, A^-1^)  since  the 

ACr)  are  chosen  that  way.  Thus  each  step  of  the  iteration  is  a  decrease,  and  the  criterion 
converges.  This  does  not  mean  that  the  procedure  has  converged,  since  it  is  conceivable  that 
the  algorithm  oec  ills  tea  between  two  or  more  curves  that  are  the  same  expected  distance 
from  the  points.  We  have  not  found  an  example  of  this  phenomenon. 

The  definition  of  principal  surfaces  is  suggestive  of  the  above  algorithm.  We  want  a 
smooth  surface  that  is  self  consistent.  So  we  start  with  the  plane  (line).  We  then  check 
if  it  is  indeed  self  consistent  by  evaluating  the  conditional  expectation.  If  not  we  have  a 
surface  as  a  by-product.  We  then  check  if  this  is  self  consistent,  and  so  on.  Once  the  self 
consistency  condition  is  met,  we  have  a  principal  surface.  By  the  theorem  quoted  above, 
this  surface  is  a  critical  point  of  the  distance  function. 

3.4.  Principal  curves  and  surfaces  for  data  sets. 

So  far  ws  have  considered  the  principal  curves  and  surfaces  for  a  continuous  multivariate 
probability  distribution,  hi  reality,  we  usually  have  a  finite  multivariate  data  set.  How  do 
we  define  the  principal  curves  and  surfaces  for  them?  Suppose  then  that  X  is  a  n  x  p  matrix 
of  ft  observations  on  p  variables.  We  regard  the  data  set  as  a  sample  from  an  underlying 
probability  distribution,  and  use  it  to  estimate  the  principal  curves  and  surfaces  of  that 
distribut  in.  We  briefly  deecribe  the  ideas  hers  and  leave  the  details  for  chapters  5  and  6. 

• 

e  The  first  step  in  the  algorithm  uses  linear  principal  components  as  starting  values. 
We  use  the  sample  principal  components  and  their  corresponding  direction  vectors  as 
initial  estimates  of  A j  and  /(°). 

•  Given  functions  fV~l)  we  can  find  for  each  *i  in  the  sample  a  value  A^-1^  =  A (*,•). 
This  can  be  done  in  a  number  of  ways,  using  numerical  optimization  techniques.  In 
practice  we  have  evaluated  at  n  values  of  A,  in  fact  at  A^“,),.--,A«",). 

is  evaluated  at  other  points  by  interpolation.  To  illustrate  the  idea  let  us  con¬ 
sider  a  curve  for  which  we  have  fV~l)  evaluated  at  for  i  =  l,  --  ,n.  For  each 

point  i  in  the  sample  we  can  project  Zj  onto  the  line  joining  each  pair 

Suppose  the  distance  to  the  projection  is  d,-*,  and  if  the  point  projects 
beyond  either  endpoint,  then  4*  is  the  distance  to  the  closest  endpoint.  Correspond¬ 
ing  to  each  4*  is  a  value  A,*  €  We  then  let  X^~l)  be  the  A,-*  that 
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corresponds  to  the  smallest  value  of  da,.  This  is  an  0(nl)  procedure,  and  as  such  is 
rather  naive.  We  use  it  as  an  illustration  and  will  describe  more  efficient  algorithms 
later. 


We  have  to  estimate  /W( A)  =  Z(X  |  *1  =  A).  We  restrict  ourselves  to  estimating 

this  quantity  at  only  n  values  of  A^-1),  namely  A  [7-1\  •  •  • ,  A«  ~l^which  we  have  already 
estimated.  We  require  E(X  |A^~^  =  A^-1^).  This  says  that  we  have  to  gather  all 
the  observations  that  project  onto  /O'"1)  at  A,-7-1^,  and  find  their  mean.  Typically 
we  have  only  one  such  observation,  namely  x,.  It  is  at  this  stage  that  we  introduce 
the  scs tterplot  tmootker,  the  fundamental  building  block  in  the  principal  curve  and 
surface  procedures  for  finite  data  sets.  We  estimate  the  conditional  expectation  at 
Aj7-1)  by  averaging  all  the  observations  Xj,  in  the  sample  for  which  a[7-1^  is  close  to 
As  long  as  these  observations  are  close  enough  and  the  underlying  density  is 
smooth,  the  bias  introduced  will  be  small.  On  the  other  hand,  the  variance  of  the 
— decreases  as  we  include  more  observations  in  the  neighborhood.  Figure  (3.6) 
demonstrates  this  local  averaging.  Once  again  we  have  just  given  the  ideas  here,  and 
will  go  into  details  in  later  chapters. 


Figure  3.8  W«  estimate  the  conditional  expectation 
E(X  |  A^-1)  =  by  averaging  the  observations  s*  for  which 

is  close  to 
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•  One  property  of  scatterplot  smoothers  in  general  is  that  they  produce  smooth  curves 
and  surfaces  as  output.  The  larger  the  neighborhood  used  for  averaging,  the  smoother 
the  output.  Since  we  are  trying  to  estimate  differentiable  curves  and  surfaces,  it  is 
convenient  that  our  algorithm,  in  seeking  a  conditional  expectation  estimate,  does 
produce  smooth  estimates.  We  will  have  to  worry  about  how  smooth  these  estimates 
should  be,  or  rather  how  big  to  make  the  neighborhoods.  This  becomes  a  variance 
versus  bias  tradeoff,  a  familiar  issue  in  non-par ametric  regression. 

•  Finally,  we  estimate  D1  W  in  the  obvious  way,  by  adding  up  the  distances  of  each  point 
in  the  sample  from  the  current  curve  or  surface. 

3.5.  Demonstrations  of  the  procedures. 

We  look  at  two  examples,  one  for  curves  and  one  for  surfaces.  They  both  are  generated 
from  an  underlying  true  model  so  that  we  can  easily  check  tnat  the  procedures  are  doing 
the  correct  thing. 

3.5.1.  The  circle  in  two-space. 

The  aeries  of  plots  in  figure  3.7  show  100  data  points  generated  from  a  circle  in  2  dimensions 
with  independent  Gaussian  errors  in  both  coordinates.  In  fact,  the  generating  functions  are 

GMSM;) 

where  A  is  uniformly  distributed  on  [0, 2x]  and  e%  and  «]  are  independent  M(0, 1). 

The  solid  curve  in  each  picture  is  the  estimated  curve  for  the  iteration  as  labelled,  and 
the  dashed  curve  is  the  true  function.  The  starting  curve  is  the  first  principal  component, 
in  figure  3.7b.  Figure  3.7a  gives  the  usual  scatterplot  smooth  of  xj  against  'xj,  which  is 
clearly  an  inappropriate  summary  for  this  constructed  data  set. 

The  curve  in  figure  3.7k  does  substantially  better  than  the  previous  iterations.  The 
figure  caption  gives  us  a  clue  why  —  the  span  of  the  smoother  is  reduced.  This  means  that 
the  sise  of  the  neighborhood  used  for  local  averaging  is  smaller.  We  will  see  in  the  next 
chapter  how  the  bias  in  the  curves  depends  on  this  span. 

The  square  root  of  the  average  squared  orthogonal  distance  is  displayed  at  each  iter¬ 
ation.  If  the  true  curve  was  linear  the  expected  orthogonal  distance  for  any  point  would 
be  y/  Ex?  =  I-  We  will  see  in  chapter  4  that  for  this  situation,  the  true  circle  does  not 
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Figure  s.7a  The  dashed  curve  is  the  usual  Figure  3.7b  The  dashed  curve  is  the 
scatterplot  smooth.  U(S)  =■  3.35  principal  component  line.  =  3.43 


•  s': 


Figure  3.7g  £>(?<«)  =  2.25  Figure  3. 7b  £>(><«>)  =  i.9l 


•*4», 
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minimise  the  distance,  but  rather  a  circle  with  slightly  larger  radius.  Then  the  minirrizing 
distance  is  approximately  <r*(l  —  1/4 p2)  =  .99.  Our  final  distance  is  even  lower.  We  still 
have  to  adjust  for  the  overfit  factor  or  number  of  parameters  used  up  in  the  fitting  proce¬ 
dure.  This  deflation  factor  is  of  the  order  nj (n  —  q)  where  q  is  the  number  of  parameters. 
In  linear  principal  components  we  know  q.  In  chapter  6  we  suggest  some  rule  of  thumb 
approximations  for  q  in  this  non- parametric  setting. 

This  example  presents  the  principal  curve  procedure  with  a  particularly  tough  job. 
The  starting  value  is  wholly  inappropriate  and  the  projection  of  the  points  onto  this  line 
does  not  nearly  represent  the  final  ordering  of  the  points  projected  onto  the  solution  curve. 
At  each  iteration  the  coordinate  system  for  the  A^  is  transferred  from  the  previous  curve 
to  the  current  curve.  Points  initially  project  in  a  certain  order  on  the  starting  vector,  as 
depicted  in  figure  3.8a.  The  new  curve  is  a  function  of  A(°)  measured  along  this  vector 
as  in  figure  3.8b  obtained  by  averaging  the  coordinates  of  points  local  in  A^.  The  new 
At1)  values  are  found  by  projecting  the  points  onto  the  new  curve.  It  can  be  seen  that  the 
ordering  of  the  projected  points  along  the  new  curve  can  be  very  different  to  the  ordering 
along  the  previous  curve.  This  enables  the  successive  curves  to  bend  to  shapes  that  could 
not  be  parametrised  in  the  original  principal  component  coordinate  system. 

3.5.3.  The  half-spbare  in  three-space. 

Figure  3.9  shows  150  points  generated  from  the  surface  of  the  half-sphere  in  3-D.  The 
simulated  model  in  polar  co-ordinates  is 

/*l\  / 5 sin(Ai) cos(Aj)  \  /ei\ 

I  *j  I  =  I  5cos(Ai)  cos(Aj)  I  +  ej  *  (3.11) 

\*3^  5sin(Aj)  )  \es) 

for  Ai  6  [0,2*]  and  Aj  €  [0,  jt/2).  The  vector  e  of  errors  is  simulated  from  a  M(0, 7) 
distribution,  and  the  values  of  Ai  and  A;  are  chosen  so  that  the  points  are  distributed 
uniformly  in  the  surface.  Figure  3.9a  shows  the  data  and  the  generating  surface.  The 
expected  distance  of  the  points  from  the  generating  half-sphere  is  to  first  order  1,  which  is 
the  expected  squared  length  of  the  residual  when  projecting  a  spherical  standard  gaussian 
3-vector  onto  a  plane  through  the  origin.  Ideally  we  would  display  this  example  on  a  motion 
graphics  workstation  in  order  to  see  the  3  dimensions.* 

*  This  dissertation  is  accompanied  by  a  motion  graphics  movie  called  Principal  Curvet  and 
Surfaces.  The  half-sphere  is  one  of  4  examples  demonstrated  in  the  movie. 


Figure  3.8  The  curve  of  the  the  first  iteration  is  a  function  of  A(°) 
measured  along  the  starting  vector  (a).  The  curve  of  the  the  second 
iteration  is  a  function  of  A^1)  measured  along  the  curve  of  the  first 
iteration  (b). 

3.6.  Principal  surfaces  and  principal  components. 

In  thin  section  we  draw  some  comparisons  between  the  principal  curve  and  surface  models 
and  their  linear  counterparts  in  addition  to  those  already  mentioned. 

3.6.1.  A  Variance  decomposition. 

Usually  linear  principal  components  are  approached  via  variance  considerations.  The  first 
component  is  that  linear  combination  of  the  variables  with  the  largest  variance.  The  second 
component  is  uncorrelated  with  the  first  and  has  largest  variance  subject  to  this  constraint. 
Another  way  of  saying  this  is  that  the  total  variance  in  the  plane  spanned  by  ths  first  two 
components  is  larger  than  that  in  any  other  plane.  By  total  variance  we  mean  the  sum  of 
the  variances  of  the  data  projected  onto  any  orthonormal  basis  of  the  subspaee  defined  by 
the  plane.  The  following  treatment  is  for  one  component,  but  the  ideas  easily  generalize  to 


two. 
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Pigurt  3.9a.  Th«  g«a«nting  rarfa*  and  Figure  3.9b.  Tfc.  principal  component 

ta*  d«u.  D(S)  *  1.0  pUn«.  £>(?«»)  „  1.59 


Figure  3.9c.  -  1.20 


Figure  3.9d.  ** 0.78 
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If  X  —  [Xi,...tXny  in  the  first  principal  component  of  X,  a  n  x  p  data  matrix,  and 
•  is  the  corresponding  direction  vector,  then  the  following  variance  decomposition  is  easily 
derived: 

53  Var(zj)  =  Var(A)  +  E  ||«  -  -A|{*  (3.12) 

j«=l 

where  Var  (•)  and  JE(-)  refer  to  sample  variance  and  expectation.  If  the  principal  component 
was  defined  in  the  parent  population  then  the  result  is  still  true  and  Var(-)  and  E(  )  have 
their  usual  meaning.  The  second  term  on  the  right  of  (3.12)  is  the  expected  squared 
distance  of  a  point  to  its  projection  onto  the  principal  direction.* 

The  total  variance  in  the  original  p  variables  is  decomposed  into  two  components:  the 
variance  explained  by  the  linear  projection  and  the  residual  variance  in  the  distances  from 
the  points  to  their  projections.  We  would  like  to  have  a  similar  decomposition  for  principal 
curves  and  surfaces. 


Let  w  now  be  any  random  variable.  Standard  results  on  conditional  expectation  show 

that: 

53  Vmr(xi)  -  53  E(x> "  *(*»  i*))* + 53  Vmr(  ®(*i  !*))•  (*•**) 

j*l  jmt  j*  t 

Ifvs  Ay(x)  and  /  is  a  principal  cun'e  so  that  I  (*,•  |Aj(x))  =s  fj(Xf(x))t  we  have 


53  v"(xf)  = 


*  -  /(»/(*))!’+  £  »„(/,(!,(*))).  (3.H) 


This  gives  us  an  analogous  result  to  (3.12)  in  the  distributional  case.  That  is,  the  total 
variance  in  the  p  coordinates  is  decomposed  into  the  variance  explained  by  the  true  curve 
and  the  residual  variance  in  the  expected  squared  distance  from  a  point  to  its  true  position 
on  the  curve.  The  sample  version  of  (3.14)  holds  only  approximately: 

£  Var(*,)  -  £  1*  -  /(A,)|f  +  £  Var (/,(*,)).  (3.15) 

<»i  j«t 

The  reason  for  this  is  that  most  practical  scatterplot  smoothers  are  not  projections,  whereas 
conditional  expectations  are. 

We  make  the  following  observations: 

*  W«  keep  ia  miad  that  X  is  considered  to  be  centered,  or  alteraativty  that  E(x)  ■»  0.  The 
above  results  are  still  tree  if  this  is  sot  the  case,  bat  the  equations  are  messier. 
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*■  if  fj (A)  =  ajX,  the  linear  principal  component  function,  then 

E  ▼«</,(*/<*>>> =E-?V"  (»•(*)) 

i=i  1=1 

=  Var(A) 

since  a  has  length  1.  Here  we  have  written  X  for  the  function  A«(z)  =  o'*. 

•  if  the  fj  are  approximately  linear  •*"  the  £elta  method  to  obtain 

E  V»(/,(AyW))  «,  £(/*(  E (!/(«)))’  Vf(A,M) 

=  VarfA^*)) 

since  we  restrict  our  curves  to  be  unit  speed  and  thus  we  have  have  ||/*||  =  1. 

3.6.2.  The  power  method. 

We  already  mentioned  that  when  the  data  is  ellipsoidal  the  principal  curve  procedure  yields 
linear  principal  components.  We  now  show  that  if  our  smoother  fits  straight  lines,  then 
once  again  the  principal  curve  procedure  yields  linear  principal  components  irrespective  of 
the  starting  line. 


Theorem  3.1 


If  the  smoother  in  the  principal  curve  procedure  produces  least  squares  straight  line  fits, 
and  if  the  initial  functions  describe  a  straight  line,  then  the  procedure  converges  to  the  first 
principal  component. 

Proof 

Let  «(°)  be  any  starting  vector  which  has  unit  length  and  is  not  orthogonal  to  the  largest 
principal  component  of  X,  and  assume  X  is  centered.  We  find  a|°*  by  projecting  z,  onto 
«(0)  which  we  denote  collectively  by 

A<°>  =  Xa(°> 

where  A^  is  a  n  vector  with  elements  a|°^,  i  =  l,...,n.  We  find  by  regressing  or 
projecting  the  vector  »j  =  (*iy, . . .  ,x^)'  onto  A(°): 


a 


(i)  _ 

*  A<°^A^°) 
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.<■>  - 

A(°>A(°) 

XXaW 

aWX'XaM 

sad  a^1)  is  renormalized.  It  can  now  be  seen  that  iteration  of  this  procedure  is  equivalent 
to  finding  the  largest  eigenvector  of  XX  by  the  power  method  (Wilkinson  1965). 


I 


Chapter  4 


Theory  for  principal  curves  and  surfaces 

In  this  chapter  we  prove  the  results  referred  to  in  chapter  3.  Ia  most  cases  we  deal  only 
with  the  principal  curve  model,  and  suggest  the  analogues  for  the  principal  surface  model. 

4.1.  The  projection  index  is  measureable. 

Since  the  first  thing  we  do  is  condition  on  it  might  be  prudent  to  check  that  it  is 

indeed  a  random  variable.  To  this  end  we  need  to  show  that  the  function  A  j  :  Rp  <-*  H1 
is  measureable.  * 

Let  /(A)  be  a  unit  speed  parameterized  continuous  curve  in  p-space,  defined  for  A  € 
[Ao.Ai]  =  A.  Let 

D(.)=  inf  {<*(., /(A))}  V«€R' 

where 

rf(*,/(A))  =  IJ*-/(A)||> 

the  usual  euclidean  distance  between  two  vectors.  Now  set 

M(x)  =  {A;d(x,/(A))  =  /?(*)}. 

Since  A  is  compact,  M(z)  is  not  empty.  Since  /,  and  hence  d(x,  /(A))  is  continuous,  Me(z) 
is  open,  and  hence  M(x)  is  closed.  Finally,  for  each  z  in  R'  we  define  the  projection  index: 

A/(*)  =  supM(x)  .. 

A^(x)  is  attained  because  Af(x)  is  closed,  and  we  have  avoided  ambiguities. 

Theorem  4.1 

A^(x)  is  a  measureable  function  of  z. 

*  I  am  grateful  to  H.  Kunsch  of  ETH,  Zurich,  for  getting  me  etarted  on.  this  proof. 
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la  order  to  prove  that  A^(x)  is  measure&ble  we  need  to  show  that  for  any  c  €  A,  the  set 
(x  (A^(x)  <  c}  is  a  measure  able  set. 

Now  x  G  (x  | Ay(x)  <  e}  <=>  for  any  A  €  (e,Ai]  there  exists  a  A*  e  (Ao,c)  such 
that  d(x,/( A))  >  d(x,/( A')).  (i.e.  if  there  was  equality  then  by  our  convention  we  choose 
Ay(x)  =  A  >  e.)  In  symbols  we  have 

{* !*/(*)<«}=  n  u  {* irf(*,/(A)) > rf(*,/(A0)> 

A€M,1V6(Ao,c] 

The  first  step  in  the  proof  is  to  show  that 

b."  n  U  >•<(*./(*;))> 

=  Ac 

where  Q  is  the  set  of  rational  numbers.  Since  for  each  A 

u  {,M(«,/(A))>J(., /(*))>  a  u  <*w*. /«)>*(*.  /«))}. 

A'ejAo^l  A'6{A«,elnq 

it  follows  that  B,  C  A«.  We  need  to  show  that  3,  D  A «.  Suppose  z  €  A*  i.e.  for  any  given 
A  €  (c,  Aj]  3  A*  €  [Ao,c]  such  that 

d(x,/(A))  >  d(x,/(A')), 

For  any  given  such  A  and  A’  we  can  find  an  t  >  0  such  that 

d(s,/(A))  =  d(z,/(A'))  +  e 

Now  since  /  is  continuous  and  the  rationals  are  dense  in  Rl  we  can  find  a  A^  6  Q  such 
that  A,  <  A'  and  d(/(A'),  /(Aj))  <  <  .  (If  A'  €  Q  we  need  go  no  further).  This  implies  that 
<f(x,/( A))  >  d(x, /(A'))  by  the  Pythagorean  property  of  euclidean  distance.  This  in  turn 
implies  that  x  €  B,  and  thus  A«  C  B,,  and  therefore  A,  =  J3a. 
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The  second  step  is  to  show  that 

d.=  n  u  ><<<*./«)» 

A,«=MilnQ  i;e(A«^|nQ 

=  B. 

Now  clearly  B,  C  De.  Suppose  then  that  z  €  De%  i.e.  for  every  A^  €  (e,  Ai]  n  Q,  there 
is  a  A'  €  [Ao,cJ  n  Q  such  that  d(z,/(A«))  >  d(z,/(Aj)).  Once  again  by  continuity  of  /  and 
because  the  rationale  are  dense  in  Rl  we  can  find  another  A*  S  Q,  A*  >  A^  such  that 

d(z,/(A))  >  d(z,/(A;))> 

for  all  A  €  [A,,  AJ].  This  means  that 

*e  n  U  <»!<<(*,/«)>  <'(*./«))> 

<w  p 
- 

for  every  A,  €  (c,  Ai]  n  Q.  In  other  words 

*  €  f|  Ex ,.a; 

=  B. 

and  we  have  that  D,  =  B,.  Finally,  each  of  the  sets  in  Z7C  is  a  half  space,  and  thus 
measureable,  D,  is  a  countable  union  and  intersection  of  measurable  sets,  and  is  thus  itself 
measurable.  | 


4.2.  The  stationarity  property  of  principal  curves. 

We  first  prove  a  result  for  straight  lines.  This  will  lead  into  the  result  for  curves.  The 
straight  line  theorem  says  that  a  principal  component  line  is  a  critical  point  of  the  expected 
distance  from  the  points  to  itself.  The  converse  is  also  true. 

We  first  establish  some  more  notation.  Suppose  /(A)  :  A  •-*  $  is  a  unit  speed  con¬ 
tinuously  c'ifferentiable  parametrized  curve  in  TRr,  where  A  is  an  interval  in  Rl.  Let  g( A) 
be  defined  similarly,  without  the  unit  speed  restriction.  An  «  perturbed  version  of  /  is 
/«  =  /(A)  +  eg( A).  Suppose  X  has  a  continuous  density  ih  Rf  which  we  denote  by  h,  and 
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let  Di  (A,  ft)  be  defined  as  before  by 

D*(h,ft)=  E*||jf-A(A/€{X))j|S 
where  A^  (JC)  parametrizes  the  point  on  f,  closest  to  X. 

Definition 

The  curve  /  is  a  critical  point  of  the  distance  function  in  the  class  §  iff 

gfea  .ovic* 

*  r=0 

(We  have  to  show  that  this  derivative  exists.) 

Theorem  4.2 

Let  /(A)  =  EX  +  A*o  with  ||«o||  =  1,  and  suppose  we  restrict  j(A)  to  be  linear  as  well. . 
So  f(A)  a  A»,  ||*||  =  1  and  $  =  C,  the  class  of  all  unit  speed  straight  lines.  Then  /  is  a 
critical  point  of  the  distance  function  in  £  iff  *q  is  an  eigenvector  of  £  =  COV(X). 

Note: 

•  WLOG  we  assume  that  EX  =  0. 

•  ||vj|  =  1  is  simply  for  convenience. 

Proof 

The  closest  point  from  x  to  any  line  Aw  through  the  origin  is  found  by  projecting  s  onto 
w  and  has  parameter  value 


Tb,  i 


Upon  taking  expected  values  we  get 

Z?*(A,Aw)  =  trE-_.  (4.1) 

We  now  apply  the  above  to  /«  instead  of  w,  but  first  make  a  simplifying  assumption.  We 
can  assume  w.l.o.g  that  «o  =  «i  since  the  problem  is  invariant  to  rotations. 


< 
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We  split »  into  a  component  e*  =  e«i  along  <1  and  an  orthogonal  component  v*.  Thus 
*  =  «•«  +  **  where  c^e*  =  0.  So  /«  =  A((l  +  ce)e1  +  £»*).  We  now  plug  this  into  (4.1)  to 

get 

nJ /«.  ((1  +  ce)«i  +  €**)'E((l  +  ce)ei  +  ev*) 

-  a* 

(1  +  ce^Ee  1  +  2e(l  +  ceJejEe*  +  eV'Ee*  v  1 

(1 +  «)*  +  «* 

Differentiating  w.r.t.  t  and  setting  e  =  0  we  get 

*  .=0 

If  *1  is  a  principal  component  of  E  then  this  term  is  aero  for  all  «*  and  hence  for  all  o . 
Alternatively,  if  this  term,  and  hence  the  derivative,  is  zero  for  all  »  and  hence  all  v**ei  =  0, 

we  have 

=  0  V  9**e\  =  0 
=>E«i  =  e«j 

=>«i  is  an  eigenvector  of  E 


Note: 

Snppoee  *  is  in  fact  another  eigenvector  of  E,  with  eigenvalue  d,  then 

This  shows  that  /  might  be  a  maximum,  a  minimum  or  a  saddle  point. 

Theorem  4.3 

Let  Q  be  the  class  of  unit  speed  differentiable  curves  defined  on  A,  a  closed  interval  of  the 
form  [a,  6J.  The  curve  /  is  a  principal  curve  of  k  iff  /  is  a  critical  point  of  the  distance 
function  in  the  class  $. 

We  make  some  observations  before  we  prove  theorem  4.3.  Figure  4.1illustrates  the  situation. 
The  curve  /«  wiggles  about  /  and  approaches  /  as  <  approaches  0.  In  fact,  we  can  see  that 
the  curvature  of  f,  is  close  to  that  of  f  for  small  e.  The  curvature  of  /«  is  given  by 

ll/IWII’ 
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Figure  (4.1)  /«(A)  depicted  u  a  function  of  /(A). 


where  JV(A)  is  the  normal  rector  to  the  curve  at  A.  Thus  l/r^(A)  <  ||/"(A)|[  /  ||/'(A)j|2  since 
the  curve  is  not  unit  speed  and  so  the  acceleration  vector  is  slightly  off  normal.  Therefore 
we  have  rjt( A)  >  ||/#(A)  +  cf'(A)||1  /  ||/"(A)  +  ep*||  which  converges  to  ry(A)  as  e  -*■  0. 

The  theorem  is  stated  only  for  curves  /  defined  on  compact  sets.  This  is  not  such  & 
restriction  as  it  might  seem  at  first  glance.  The  notorious  space  filling  curves  are  excluded, 
but  they  are  of  little  interest  anyway.  If  the  density  h  has  infinite  support,  we  have  to  box  it 
in  R*  in  order  that  /,  defined  on  a  compact  set,  can  satisfy  either  statement  of  the  theorem. 
(We  show  this  later.)  In  practice  this  is  not  a  restriction. 

Proof  of  theorem  4.3. 

We  use  the  dominated  convergence  theorem  (Chung,  1974  pp  42)  to  show  that  we  can 
interchange  the  orders  of  integration  and  differentiation  in  the  expression 

'^(U)^**  ||Ar-/((AA(Af))||S.  (4.3) 

We  need  to  find  a  random  variable  Y  which  is  integrable  and  dominates  almost  surely  the 
absolute  value  of 

„  ~/«(A/,(X))||*  -  I*  -  /(A/W)[J 

=  - -  ■■  -  -  - - 
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for  all  €  >  0.  Notice  that  by  definition 

if  this  limit  exists.  Now 


Expanding  the  first  norm  we  get 

||X-  /.(A,^))!*  =  \\X  -  /(A/(X))||*  +  e2||tf(A/(X))||,-2e  (AT  -  /( Xf{X)))  -9(\f(X)), 
and  thus 

Z.  <  -2  (X  -  f{Xf(X)))  •  g(Af(X))  +  e  ||lf(A/(Af))||J 
<Yi 

where  Y\  is  some  bounded  random  variable. 

Similarly  we  have 

„  _  |^-/.(A/.(X))||,-||X-/(A,.(X))||’ 

Z.>— - j - 

We  expand  the  first  norm  again,  and  get 

Zt  >  -2  (X  —  f(Aft(X)fj  •  g(Aft(X))  +  e  J|jp(A^(JT))|| * 

* 

where  Yj  is  once  again  some  bounded  random  variable.  These  two  bounds  satisfy  the  con¬ 
ditions  of  the  dominated  convergence  theorem,  and  so  the  interchange  is  justified.  However, 
from  the  form  of  the  two  bounds,  and  because  /  and  g  are  continuous  functions,  we  see 
that  the  limit  lim«-^)  Zt  exists  whenever  Ajt{X)  is  continuous  in  <  at  t  =  0.  Moreover,  this 
limit  is  given  by 

**  “2  (x  —  f{Aj(X)fj  •  g(Aj(X)). 

We  show  in  lemma  4.3.1  that  this  continuity  condition  is  met  almost  surely. 


44  Section  4-2:  The  staticnarity  property  of  principal  curves 
We  denote  the  distribution  function  of  A f(X)  by  h\,  and  get 

^D\KU)  =  -2  Ekjk  (  E(X  \Xf(X)  =  A)  -  /(A))  •  g{X).  (4.4) 

If  /(A)  is  a  principal  curve  of  h,  then  E  (X  |  Xj  (X)  =  A)  =  /(A)  for  all  A  in  the  support 
of  Ax,  and  thus 

=0  V  differentiable  j. 

«=o 

Alternatively,  suppose  that 

E*,  (  E(X  -  /(A)  |A,(X)  =  A)  v,(A))  =  0  (4.5) 

for  all  differentiable  q.  In  particular  we  could  pick  g( X)  =  E(X  |  A  j (X)  =  A)  —  /(A).  Then 

||  E(JT  |  A/pT)  =  A)  -  /(A)|J  =  0 

and  consequently  /  is  a  principal  curve.  This  choice  of  g,  however,  might  not  be  differen¬ 
tiable,  so  some  approximation  is  needed. 

S:nce  (4.5)  holds  for  all  differentiable  g  we  can  use  different  g’s  to  knock  off  different 
pieces  of  E(X  |Ay(X)  =  A)  —  /(A).  In  fact  we  can  do  it  one  co-ordinate  at  a  time.  For 
example,  suppose  E(Xi  |Aj(X)  =  A)  is  positive  for  almost  every  A  6  (Aq,Ai).  We  suggest 
why  such  an  interval  will  always  exist.  We  will  show  that  Xj{x)  is  continuous  at  almost 
every  x.  The  set  {X  |  Xf(X)  =  A  €  (Ao,  Ai)}  is  the  set  of  X  which  exist  in  an  open  connected 
set  in  the  normal  plane  at  A,  and  these  normal  planes  vary  smoothly  as  we  move  along  the 
curve.  Since  the  density  of  X\  is  smooth,  it  does  not  change  much  as  we  move  from  one 
normal  plane  to  the  next,  and  thus  its  expectation  does  not  change  much  either.  We  then 
pick  a  differentiable  g\  so  that  it  is  also  positive  in  that  interval,  and  xero  elsewhere,  and 
set  gi  3  •••  =  gp  =  0.  We  apply  the  theorem  and  get  E(Xj  |Ay(X)  =  A)  =  /i(A)  for 
A  €  (Ao,Ai).  We  can  do  this  for  all  such  intervals,  and  for  each  co-ordinate,  and  thus  the 
result  is  true.  I 

Corollary 

If  a  principal  curve  is  a  straight  line,  then  it  is  a  principal  component. 
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Proof 


If  /  is  a  principal  curve,  then  theorem  4.3  is  true  for  all  g,  in  particular  for  g(X)  =  Xv.  We 
then  invoke  theorem  4.2.  I 

In  order  to  complete  the  proof,  we  need  to  prove  the  following 
Lemma  4.3.1 

The  projection  function  A ^  (x)  is  continuous  at  e  =  0  for  almost  every  z  in  the  support  of 


Proof 

Let  us  consider  first  where  it  will  not  be  continuous.  Suppose  there  are  two  points  on  / 
equidistant  from  x,  and  no  other  points  on  /  are  as  close  to  z.  Thus  3  Ao  >  Ai  ,  Ay(z)  =  Ao 
and  ||z  -  /(Ao)||  =  ||z  -  /(Ai)||.  It  is  easy  to  pick  g  in  this  situation  such  that  Xjt[x)  is  not 
continuous  at  e  —  0.  We  call  such  points  ambiguous.  However,  we  prove  in  lemma  4.3.2 
that  the  set,  of  all  ambiguity  points  for  a  finite  length  differentiable  curve  has  measure  zero. 
We  thus  exclude  them. 

Suppose  ui  >  0  is  given,  and  there  is  no  point  on  the  curve  as  close  to  z  as  /( Xj(z))  — 
/(Ao).  Thus  ||z  -  /(Ao)||  <  ||z  -  /(Ai)jj  V  Ax  €  [a,  4)  n  (Ao  -  w,Ao  +  w)e.  (Notice  that  at 
the  boundaries  the  w  interval  can  be  suitably  redefined.)  Since  this  interval  is  compact, 
and  the  distance  functions  are  differentiable,  we  can  find  a  6  >  0  such  that  ||z  —  /(A0)||  < 
||*-/(Ax)||  -  S.  Let  M  -  supx€[a>|  ||ff(A)||  and  e0  =  6/(2M).  Then  ||z-  /€(A0)||  < 
||z—  /«(Ai)||  V  Ax  €  [a, 4]  O  (Ao  —  «,Ao  +  w)**  khd  Ve  <  £o-  This  implies  that  A jt{x)  6 
(Ao  —  «,  Ao  +  «),  and  the  continuity  is  established.  | 


lemma  4.3.2 

The  set  of  ambiguity  points  has  probability  measure  zero. 

Proof 

We  prove  the  lemma  for  a  curve  in  2-space,  but  the  proof  generalizes  to  higher  dimensions. 
Referring  to  figure  4.2,  suppose  a  is  an  ambiguity  point  for  the  curve  /  at  A.  We  draw  the 
circle  with  center  a  and  tangent  to  /  at  A.  This  means  that  /  must  be  tangent  to  the  circle 
somewhere  else,  say  at  /(A1).  If  h  on  the  normal  at  /(A)  is  also  an  ambiguity  point,  we  can 
draw  a  similar  circle  for  it.  But  this  contradicts  the  fact  that  /(A)  is  the  closest  point  to  a, 
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Figure  4.2  There  are  at  moat  two  ambiguity  points  on  the  normal 
to  the  curve;  one  on  either  side  of  the  curve. 

si  ice  the  circle  for  h  lies  entirely  inside  the  circle  for  a,  and  by  the  ambiguity  of  b  we  know 
the  curve  must  touch  this  inner  circle  somewhere  other  than  at  /(A). 

Let  I(X)  be  an  indicator  function  for  the  set  of  ambiguity  points.  Since  ther  are  at 
most  two  at  each  A,  we  have  that  Ef I(X)  |  \j(X)  =  A)  =  0.  But  this  also  implies  that  the 
unconditional  expectation  is  zero.  I 

Corollary 

The  projection  index  Aj(z)  is  continuous  at  almost  every  z. 

Proof 

We  show  that  if  Aj(z)  is  not  continuous  at  z,  then  z  is  ra  ambiguity  point.  But  this  set 
has  measure  zero  by  lemma  4,3.2. 

If  A f(x)  is  not  continuous  at  z,  there  exists  a  <o  *  0  such  that  for  every  6  >  0  3  *# 
such  that  j|z  —  z$\\  <  S  but  |a^(z)  -  A^(z«)j  >  jq.  Letting  £  30  to  zero,  we  see  that  z  must 
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Figure  4.3  The  act  of  point*  to  the  right  of  /(a)  that  project  there 

has  maaawe  aero. 


be  equidistant  to  Xf(x)  and  at  least  one  other  point  on  the  curve  with  projection  index  at 
least  <o  from Xj (x).  | 

Theorem  4.3  proves  the  equivalence  of  two  statements:  /  is  a  principal  curve  and  / 
is  a  critical  value  of  the  distance  function.  We  needed  to  assume  that  /  is  defined  on  a 
compact  set  A.  This  means  that  the  curve  has  two  ends,  and  any  data  beyond  the  ends 
might  well  project  at  the  endpoints.  This  leaves  some  doubt  as  to  wether  the  endpoint  can 
be  the  average  of  these  points.  The  next  lemma  shows  that  for  either  statement  of  the 
theorem  to  be  true,  some  truncation  of  the  support  of  h  might  be  necessary  'if  the  support 
is  unbounded). 


Lemma  4.3.3 
If  /  is  a  principal  curve,  then  (x  -  f(Xj(x))J 
A.  if  <£!£& 


««0 


/'(Ay(x))  =  0  as.  for  x  in  the  support  of 

=  0  V  differentiable  f ,  the  l  the  same  u  true.  By  /'(a)  we  mean  the 
derivative  from  the  right,  and  similarly  from  the  left  for  /'(&). 

Proof 

If  Xj(x)  €  (a, 6)  the  proof  is  immediate.  Suplpoee  then  that  Xjfx)  -  a.  Rotate  the  co¬ 
ordinates  so  that  f'{a)  =  «j.  No  points  to  the  left  of  /(a)  project  there.  Suppose  /  is  a 
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principal  curve.  This  then  implies  that  the  act  of  points  that  are  to  the  right  of  f(a)  and 
project  at  /(a)  has  conditional  measure  zero,  else  the  conditional  expectation  would  be  to 
the  right.  Thus  they  also  have  unconditional  measure  zero. 

Alternatively,  suppose  that  there  is  a  set  of  a  of  positive  measure  to  the  right  of  f(a) 

I 

that  projects  there.  We  can  construct  3  such  that  3 (a)  =  /*(a),  and  zero  everywhere  else. 
For  such  a  choice  of  3  it  is  clear  that  the  derivative  cannot  be  zero.  However,  this  choice  of 
3  is  not  continuous.  Bat  we  can  construct  a  version  of  3  that  is  differentiable  and  does  the 


1  job  as  y.  We  have  then  reached  a  contradiction  to  the  claim  that 
differentiable  3.  | 


=  0  v 


«*0 


4.3.  Some  results  on  the  subclass  of  smooth  principal  curves. 


We  have  defined  a  subset  T,(h)  of  principal  curves.  These  are  principal  curves  for  which 
Ay  (a)  is  a  continuous  function  at  each  a  in  the  support  of  h.  In  the  previous  section  we 
showed  that  if  Ay  (a)  is  not  continuous  at  a,  then  a  is  an  ambiguity  point.  We  now  prove  the 
converse:  no  points  of  continuity  are  ambiguity  points.  This  will  prove  that  the  continuity 
constraint  indeed  avoids  ambiguities  in  projection. 

In  figure  4.4a  the  curve  is  smooth  but  it  wraps  around  so  that  points  dose  together 
might  project  to  completely  different  parts  of  the  curve.  This  reflects  a  global  property  of 
the  curve  and  presents  an  ambiguity  that  is  unsatisfactory  in  a  summary  of  a  distribution. 

Theorem  4.4 

If  Ay  (a)  is  continuous  at  a,  then  a  is  not  an  ambiguity  point. 

Proof 

We  prove  by  contradiction.  Suppose  ws  have  an  a,  and  Af  #  A*  such  that 

II*  -  /(Ai)||  =  Us  - 
=  <<(*./) 

It  is  easy  to  see  that  if  Aj  yields  the  eloeest  point  on  the  curve  for  a,  then  At  is  the  position 
that  yields  the  minimum  for  all  a.t  =  ai/(A|)  +  (1  -  e»t)a  for  a  €  (0, 1).  Similarly  for  A*. 
Now  the  idea  is  to  let  at  and  aj  get  arbitrarily  small,  and  thus  ||a«^  -  s«,||  gets  small,  but 
^/(*«t)  “  */(*•»)  38  constant  and  this  violates  the  continuity  of  A y(*)  I 

Figure  4.4b  represents  the  other  ambiguous  situation,  this  time  caused  by  a  local 
property  of  the  curve.  We  consider  only  points  inside  the  curve.  If  such  points  can  occur  at 
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,  Figure  4.4  T1m  caatiaeity  constraint  avoids  global  ambiguities  (a) 
and  local  awhigniti—  (b)  ia  projection. 

the  cantor  of  cvmtui*,  than  there  ia  bo  uaiqua  point  of  projection  on  the  curve.  By  intide 
we  mean  that  the  inner  product  (a  -  /(A /(*)))  •  (ef  (*/(*))  -  /( A /(*)))  ia  non-negative, 
where  «f( A)  »  Qua  canter  of  curvature  of  /  at  the  point  /(A). 

Theorem  4.5 

If  A y  (*)  ia  cootinuooa  at  a,  than  •  ia  not  at  the  center  of  curvature  of  /  at  A. 

Proof 

The  idea  of  the  proof  ia  illustrated  in  figure  4.4b.  If  a  point  at  ej(A)  projects  at  A,  then  it 
will  project  at  many  other  points  immediately  around  A,  since  locally  /(A)  behaves  like  the 
arc  of  a  circle  with  center  tj( A).  This  would  contradict  the  continuity  of  A j.  Furthermore, 
if  a  point  at  a  beyond  «j( A)  projects  at  A,  we  would  expect  that  points  on  either  side  of  a 
would  project  to  different  parts  of  the  curve,  and  this  would  also  contradict  the  continuity 
of  A;. 
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We  now  make  these  ideas  precise.  Assume  z  projects  at  Xj{z)  =  Ao,  where 

' = /(Ao) + ]i7^ra(ii7?CT + 6) 

and  6  >  0.  Thus  z  is  00  or  beyond  the  center  of  curvature  of  /  at  Ao.  Let  9(A)  ||/(A)  —  x||. 

By  hypothesis  f(A)  >  f(Ao)  with  equality  holding  iff  A  =  Ao.  (Otherwise  there  would  be  at 
least  two  points  on  the  curve  the  same  distance  from  z  and  this  would  violate  the  continuity 
of  Xj).  This  implies  that 

(1)  ✓(*>)  =  0 

(2)  f*(Ao)  >  0  for  a  strict  minimum  to  be  achieved. 

We  evaluate  these  two  conditions; 

rt*) -/(*>•  wo.)-*) 

= *  'iin^)ii(!i7%Ml+ 6) 


f-(Ao)  «  /-(Ao)  •  (/(Ao)  -  z)  +  /*(Ao)  •  /*(Ao) 

=  /*(A«) - - * - +  51  +  1 

ii/-(Ao)ini/-(Ao)ir4,+l 


=  -lir(Ao)i|5 


which  contradicts  (2)  above. 


4.4.  Some  results  on  bias. 

The  principal  curve  procedure  is  inherently  biased.  There  are  two  forms  of  bias  that  can 
occur  concurrently.  We  identify  them  as  model  bioo  and  ettimation  bio*. 

Model  bias  occurs  in  the  framework  of  a  functional  model,  where  the  data  is  generated 
from  a  model  of  the  form  s  =  /(A)  ■+■  «,  and  we  wish  to  recover  /(A).  In  general,  starting 
at  /(A),  the  principal  curve  procedure  will  not  have  /(A)  as  its  solution  curve,  but  rather 
a  biased  version  thereof.  This  bias  goes  to  xero  with  the  ratio  of  the  noise  variance  tothe 
radius  of  curvature. 

Estimation  bias  occurs  because  we  use  scatterplot  smoothers  to  estimate  conditional 
expectations.  The  bias  is  introduced  because  we  average  over  neighborhoods,  and  this 
usually  has  a  flattening  effect. 
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Figure  4.5  Th«  data  is  generated  from  the  arc  of  a  circle  with 
radina  p  and  with  iid  M  (0,0*1)  error*.  The  location  on  the  circle  ie 
selected  uniformly. 

4.4.1.  A  simple  model  for  investigating  bias. 

The  scenario  we  shall  consider  is  the  arc  of  a  circle  in  2-space.  This  can  be  parametrized 
by  a  unit  speed  curve  /(A)  with  constant  curvature  1/p,  where  p  is  the  radius  of  the  circle: 


for  A  €  (~A/,A/]  C  [~wp,wp\.  For  the  remainder  of  this  section  we  will  denote  intervals  of 
the  type  (-As.AsJ  by  A*. 

The  points  a  are  generated  as  follows:  First  a  A  is  selected  uniformly  from  A /.  Given 
this  value  of  A  we  pick  the  point  s  from  some  smooth  symmetric  distrioution  with  first  two 
moments  (/(A),  e*I)  where  a  has  yet  to  be  specified.  Intuitively  it  seems  that  more  mass 
gets  put  outside  the  circle  than  inside,  and  so  the  circle,  or  arc  thereof,  that  gets  closest 
to  the  data  has  radius  larger  than  p.  Consider  the  points  that  project  onto  a  small  arc  of 
the  circle  (see  figure  4.5).  They  lie  in  a  segment  which  fans  out  from  the  origin.  As  we 
shrink  this  are  down  to  a  point,  the  segment  shrinks  down  to  the  normal  to  the  curve  at 
that  point,  but  there  is  always  more  mass  outside  the  circle  than  inside.  So  when  we  take 
conditional  expectations,  the  mean  lies  outside  the  circle. 

One  would  hope  that  the  principal  curve  procedure,  operating  in  distribution  space 
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and  starting  at  the  true  curve,  would  converge  to  this  minimizing  distance  circle  in  this 
idealized  situation.  It  turns  out  that  this  is  indeed  the  case. 

Figure  4.5  depicts  the  situation.  We  have  in  mind  situations  where  the  ratio  a/p  is 
small  enough  to  guarantee  that  P(|e|  >  p)  sa  0.  This  effectively  keeps  the  points  local; 
they  will  not  project  to  a  region  on  the  circle  too  far  from  where  they  were  venerated. 

Theorem  4.6 

Let  /(A),  A  €  Ay  be  the  arc  of  a  circle  as  described  above.  The  parameter  A  is  distributed 
uniformly  in  the  arc,  and  given  A,  *=/( A)  +  «  where  the  components  of  e  are  iid  with  mean 
0  variance  a*.  We  concentrate  on  a  smaller  arc  A #  inside  Ay,  and  assume  that  the  ratio 
a/p  is  small  enough  to  guarantee  that  all  the  points  that  project  into  A*  actually  originated 
from  somewhere  within  Ay. 


®(*  |A/(*)  €  As) 


■(:) 


where 


A  tip  —  9/2  and 


...  «»in {9/2) 
9/2  * 


r*  =  lim  r» 

=  Vy/b>  +  a)*  +  el 

Finally  r*  *  p  as  a/p  —*  0. 

Lemma  4.6.1 

Suppose  A i  —  np.  (We  have  a  full  circle.)  The  radius  of  the  circle,  with  the  same  center 
as  /(A),  that  minimizes  the  expected  squared  distance  to  the  points  is* 


Also  r*  — *  p  as  a/p  — ♦  0. 


*  I  thank  Art  Owen  for  suggesting  this  result. 
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Proof  of  lemma  4.6.1 

The  situation  is  depicted  in  Figure  4.5.  For  a  given  point  x  the  squared  distance  from  a 
circle  with  radius  r  is  the  radial  distance  and  is  given  by 


The  expected  drop  in  the  squared  distance  using  a  circle  with  radius  r  instead  of  p  is  given 

by 

EA D2(x,r,p)  =  E  d*(x,p)  -  Ed*(x,r) 

=  E(||.||-p)’-  E(||.||-r)*  . 

We  now  condition  on  A  =  0  and  expand  (4.8)  to  get 

XAD*(s,r,p  | A  =  0)  =  p1  —  r*  +  2(r  -  p)  E yj(p  +  ei)»  +  e§ 

Differentiating  w  j.t.  r  we  see  that  a  mavirmim  is  achieved  for 

r  =  r* 


*  +  *i 


7.  =  p  E  V(1  +  ei/p)3  +  (e»/p)2 

>  p  E  |1  +  ej/pj 

>  p |  E(1  +  ei/p)|  (Jensen) 

=  p 

with  strict  inequality  iff  Var(e,/p)  =  e2/p*  =  0.  Note  that 


EAJ?2(x,  r*,p)  =  (p  —  E^p  +  nJ’  +  ef)2  (4.9) 

which  is  non-negative. 

When  we  condition  on  some  other  value  of  A,  we  can  rotate  the  system  around  so  that 
A  o  0  since  the  distance  is  invariant  to  such  rotations,  and  thus  for  each  value  of  A  the  same 
r*  maximises  EAD2(x, r,p  j A),  and  thus  maximizes  EA Di(x,r,p).  I 

Note:  We  can  write  the  expression  for  r*  as 

r*  =  pE^(l  +  <l)2  +  e§ 


(4.10) 


54  Section  4>4:  Some  result*  on  bias 


where  t,  =  a/?,  c,-  ~  (0,5),  ard  6  =  off.  Expanding  the  square  root  expression  using  the 
Taylor’s  expansion  we  get 

r*w,  +  <r3/(2  p).  (4.11) 

This  yields  an  expected  squared  distance  of 

Ed*(JC,r*)  »  <r3  -  n4/(4/>3) 

which  is  smaller  than  the  usual  n3.  This  expression  was  also  obtained  by  Efron  (1984). 

Proof  of  theorem  4.6. 

We  will  show  that  in  a  segment  of  sue  ^  the  expected  distance  from  the  points  in  the 
segment  to  their  mean  converges  to  the  expected  radial  distance  as  ^  — ►  0.  If  we  consider 
all  such  segments  of  size  the  conditional  expectations  will  lie  oil  the  circumference  of 
a  circle.  By  definition  the  conditional  expectations  minimize  the  squared  distances  to  the 
points  in  their  segments,  and  hence  in  the  limit  the  radial  distance  in  each  segment.  But 
so  did  r*,  and  the  results  follow. 

Suppose  that  4  is  chosen  so  that  2ar/4  is  a  positive  integer.  We  divide  the  circle  up 
into  segments  each  with  arc  angle  Consider  E(x  |  A j(x)  £  A^),  where  and  A^  are 
defined  above. 

Figure  4.6depicts  the  situation.  The  points  are  symmetrical  about  the  zi-axis,  so  the 
expectation  will  be  of  the  form  (r,  0)'.  By  the  rotational  invariance  of  the  problem,  if  we 
find  these  conditional  expectations  for  each  of  the  segments  in  the  circle,  we  end  up  with  a 
circle  of  points,  spaced  degrees  apart  with  radius  r. 

We 
of  points 
circle  wit 

*((*- 

Also,  we 

*((* 


irst  show  that  as  ^  *  0,  r  —»  r».  In  order  to  do  this,  let  us  compare  the  distance 
from  their  mean  vector  r  =  (r,  0)'  in  the  segment,  to  their  radial  distance  from  the 
h  radius  r.  If  we  let  r(x)  denote  the  radial  projection  of  z  onto  the  circle,  we  have 


E(*  lA^*)  €  A,)3  JA/f*)  €  A,j  =  E[(z  -  r)S  |A/(»)  €  A,j 

>  E((*-r(*))3lAf(*)€A*| 


(4.12) 


0*  !*/(*) €  A*i 


=  E[(*  -  r(x))3  |A/(*)  6  A*)  +  E[(r(x)  -  r)3  |A/(z)  e  A*J 
-  2  E  (|r(z)  -  r|  |*  -  r(z)|  cos(^(x))  |  A/(x)  €  A,) 
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Figure  4.8  The  conditional  at  x,  pna  Ay(i)  g  A*. 

whan  4>{x)  an  tha  angles  as  depicted  in  figure  4.6.  The  second  term  on  the  right  of  (4.13) 
is  smaller  than  (r^/2)*.  We  treat  separately  the  case  when  x  is  inside  the  circle,  and  when 
s  is  outside. 

•  When  9  is  inside  the  circle,  4>(z)  is  acute  and  hence  coe(^>(z))  >  0.  Thus 

E[(t  -  r)*  |  *;{*)€  A,] 

<  E((*  —  r(*))*  | Aj(x)  €  A^]  +  0(4) 

•  When  9  is  outside  the  circle,  4(*)  is  obtuse  and  coe(^(x))  <  0.  Since  —  co a(rj>(x))  = 
«u>(0(*)  -  »/2)  and  from  the  figure  4(x)  -  w/2  <  «^/4,  we  have  that  -  coa(V>(x))  < 
*“(^/4)  =  0(4).  Now  E[(|v(x)  -  r|  •  |*  -  r(*)|)  |Ay(x)  €  A*|  is  bounded  since  the 
errors  an  assumed  to  have  finite  second  moments.  Thus  (4.14)  once  again  holds. 

So  from  (4.12)  and  (4.14)  ,  as  4  — *  0,  the  expected  squared  radial  distance  in  the  segment 
and  the  expected  squared  distance  to  the  mean  vector  converge  to  the  same  limit.  Suppose 

E(x|A/(*)  =  0)  =  r“ 
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Since  the  conditional  expectation  r**  minimizes  the  expected  squared  distance  in  the  seg¬ 
ment,  this  tells  us  that  a  circle  with  radius  r**  minimizes  the  radial  distance  in  the  segment. 
Since,  by  rotational  symmetry,  this  is  true  for  each  such  segment,  we  have  that  r**  minimizes 

I,E(||»||-r)1|i/(,)  =  «=  E(M|-r)>. 

This  then  implies  that  r**  =  r*  by  lemma  4.6.1  and  thus 

Urn  *(*  |A/(x)  €  A^)  =  B(*  |A/(*)  =  0) 

=  r* 

This  is  the  conditional  expectation  of  points  that  project  to  a  an  arc  of  size  0  or  simply  a 
point.  In  order  to  get  the  conditional  expectation  of  points  that  project  onto  an  arc  of  size 
#,  we  simply  integrate  over  the  arc: 


Corollary 

The  above  results  generalize  exactly  for  the  situation  where  data  is  generated  from  a  sphere 
in  R*.  The  sphere  that  gets  closest  to  the  data  has  radius 

r*  =  E^/(p  +  ei)J  +  e|  +  e| 

and  this  is  exactly  the  conditional  expectation  of  x\  for  points  whose  projection  is  at  (/>,  0, 0)'. 

\ 
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Corollary 

If  the  data  is  generated  from  the  circumference  of  a  circle  as  above,  the  principal  curve 
procedure  converges  after  one  iteration  if  we  start  at  the  model.  This  is  also  true  for  the 
principal  surface  procedure  if  the  data  is  generated  from  the  surface  of  a  sphere. 

Proof 

After  one  iteration,  we  have  a  circle  with  radius  r*.  All  the  points  project  at  exactly  the 
same  position,  and  so  the  conditional  expectations  are  the  same.  This  is  also  true  for  the 
principal  surface  procedure  on  the  sphere.  | 

4.4.2.  From  the  circle  to  the  helix. 

The  circle  gives  us  insight  into  the  behaviour  of  the  principal  curve  procedure,  since  we 
can  imagine  any  smooth  curve  as  being  made  up  of  many  ares  of  circles.  Equation  (4.15) 
clearly  separates  and  demonstrates  the  two  forms  of  bias: 

•  Model  bias  since  r*  >  p. 

•  Estimation  bias  since  the  co-ordinate  functions  are  shrunk  by  a  factor  sin(0/2)/(0/2) 
when  we  average  within  arcs  or  spans  of  size  9. 

For  a  sufficiently  large  span,  the  estimation  bias  will  dominate.  Suppose  that  in  the  present 
••tup,  »  =  p/4.  Then  from  (4.11)  we  have  that  r*  =  1.031p.  From  (4.7)  we  see  that 
a  smoother  with  span  corresponding  to  0.27*  or  14%  of  the  observations  will  cancel  this 
effect.  This  is.  considered  a  small  span  for  moderate  sample  sizes.  Usually  the  estimation 
bias  will  tend  to  flatten  out  curvature.  This  is  not  always  the  case,  as  the  circle  example 
demonstrates.  In  this  special  setup,  the  center  of  curvature  remains  fixed  and  the  result  of 
flattening  the  co-ordinate  functions  is  to  reduce  the  radius  of  the  circle.  The  central  idea  is 
■till  dear:  model  bias  is  in  a  direction  away  from  the  center  of  curvature,  and  estimation 
bias  towards  the  center. 

We  can  consider  a  circle  to  be  a  flattened  helix.  We  show  that  as  we  unflatten  the  helix, 
the  effect  of  estimation  bias  changes  from  reducing  the  radius  of  curvature  to  increasing  it. 

To  fix  ideas  we  consider  again  the  circle  in  R*.  As  we  have  observed  the  result  of 
estimation  and  model  bias  is  to  reduce  the  expected  radius  from  1  to  r  (for  a  non-zero  span 
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smoother  such  that  r  <  1).  Thus  we  have 


with 


/j«||  ■ 


r.  The  rep^rameterized  curve  is  given  by 


/  rcos(A/r) 
yrsin(A/r) 


) 


9 


and  by  definition  the  radius  of  curvature  is  r  <  1.  Here  the  center  of  curvature  remains  the 
same,  but  this  is  not  usually  the  case. 

A  unit  speed  helix  in  1RS  can  be  represented  by 


/coa(A /c)\ 
/(A)  =  sin(A/c) 

V  bX/c  ) 


where  c*  =  1  +  6*.  It  is  easy  to  check  that  rj  —  1  +  6*,  so  even  though  the  helix  looks  like  a 
circle  with  radius  1  when  we  look  down  the  center,  it  has  a  radius  of  curvature  larger  than 
1.  This  is  because  the  osculating  plane,  or  plane  spanned  by  the  normal  vector  and  the 
velocity  vector,  makes  an  angle  with  the  —  xj  plane.  In  the  case  of  a  circle,  the  effect  of 
the  smoothing  was  to  shrink  the  co-ordinates  by  a  factor  r.  For  a  certain  span  smoother, 
the  helix  co-ordinates  will  become  (r cos(A/c),  r  sin(A/c),  bX/c)1.  Notice  that  straight  lines 
are  preserved  by  the  smoother.  Thus  the  new  unit  speed  curve  is  given  by 


(rcoe(A/e*J  \ 
rsin(A/c*)  J  , 
bX/c*  J 


where  e*  =  r*  +  b7.  The  radius  of  curvature  is  now  (ra  +  5J)/r.  If  we  look  at  the  difference 
in  the  radii  we  get 


r 2  +  b*  .  lS 

rJ~rf  —  — - — —1  +  6* 

_  (l-f)y-r) 
r 

>0  if  6*  >  r 

This  satisfies  our  intuition.  For  small  6  the  helix  is  almost  like  a  circle  and  so  we  expect 
circular  behaviour.  When  6  gets  large,  the  helix  is  stretched  out  and  the  smoothed  version 
has  a  larger  radius  of  curvature. 
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4.4.5.  One  more  bias  demonstration. 

We  conclude  this  section  with  one  further  example.  So  far  we  have  discussed  bias  in  a  rather 
oversimplified  situation  of  constant  curvature. 


Figure  4.7  The  thick  curve  is  the  the  principal  curve  using  conditional  expectations  at  the 
model,  and  shows  the  model  bias.  The  two  dashed  curves  show  the  compounded  effect  of  model 
and  estimation  bias  at  spans  of  30%  and  40%. 

A  sine  wave  in  R1  does  not  have  constant  curvature.  In  parametric  form  we  have 


A  simple  calculation  shows  that  the  radius  of  curvature  rj(\)  is  given  by 

1  _  sin(Ax) 

r/(A)  “  (1  +  cos2(Ax))*/*  * 

and  achieves  a  minimum  radius  of  1  unit.  The  model  for  the  data  is  X  —  /(A)  +  «  where 
A  —  £f[0,2]  and  e  ~  M (0,1/4)  independent  of  A.  Figure  4.7shows  the  true  model  (solid 
curve),  and  the  points  are  a  sample  from  the  model,  included  to  give  an  idea  of  the  error 
structure.  The  thick  curve  is  B(AT  |  A^(Af)  =  A).  Here  is  a  situation  where  the  model 
bias  results  in  a  curve  with  more  curvature,  namely  a  minimum  radius  of  0.88  units.  This 


60  Section  4.5:  Principal  curves  of  elliptical  distributions 

curve  was  found  by  simulation,  and  is  well  approximated  by  1/0.88  sin(Ax).  There  are  two 
dashed  curves  in  the  figure.  They  represent  E (AT  |  A  j(X)  €  A,(A)),  where  A,(A)  represents 
a  symmetric  interval  of  length  sA  about  A  (Boundary  effects  were  eliminated  by  cyclically 
extending  the  range  of  A.)  We  see  that  at  s  —  30%  the  estimation  bias  approximately 
cancels  out  the  model  bias,  whereas  at  s  =  40%  there  is  a  residual  estimation  bias. 

4.5.  Prircipal  curves  of  elliptical  distributions. 

We  have  seen  that  for  elliptical  distributions  the  principal  components  are  principal  curves. 
Are  there  any  more  principal  curves?  We  first  of  all  consider  the  uniform  disc  with  no  holes. 
For  this  distribution  we  propose  the  following: 


Figure  (4.8)  The  only  principal  curves  in  7e(k)  of  a  uniform  disk 
are  the  principal  components. 

Proposition 

The  only  principal  curves  in  /e(A)  are  straight  lines  through  the  center  of  the  disk. 

An  informal  proof  of  this  claim  is  as  follows; 

•  Any  principal  cur/e  must  enter  the  disk  once  and  leave  it  once.  This  must  be  true 
since  if  it  were  to  remain  inside  it  would  have  to  circle  afound.  But  this  would  violate 
the  continuity  constraint  imposed  by  Te{h)  since  there  would  have  to  exist  points  at 
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the  centers  of  curvature  of  the  curve  at  some  places.  Furthermore,  it  cannot  end  inside 
the  disk  for  reasons  similar  to  those  used  in  lemma  4.3.3. 

•  The  curve  enters  end  leaves  the  disk  normal  to  the  circumference.  For  symmetry 
reasons  this  must  be  true.  As  it  enters  the  disk  there  must  be  equal  mass  on  both 
sides. 

•  The  curve  never  bends  (see  figure  4.8).  At  the  first  point  of  curvature,  the  normal 
to  the  curve  will  be  longer  on  one  side  than  the  other.  The  set  of  points  that  project 
at  this  spot  will  not  be  conditionally  uniformly  distributed  along  the  normal.  This 
is  because  the  set  is  the  limit  of  a  sequence  of  segments  with  center  at  the  center  of 
curvature  of  the  curve  at  the  point  in  question.  Also,  all  points  in  the  segment  will 
project  onto  the  arc  that  generates  the  segment;  if  not  the  continuity  constraint  would 
be  violated.  So  in  addition  to  the  normal  being  longer,  it  will  have  more  man*  on  the 
long  side  as  well.  This  contradicts  the  fact  that  the  mean  lies  on  the  curve. 

Thus  the  only  curves  allowed  are  straight  lines,  and  they  will  then  have  to  pass  through  the 
center  of  the  disk. 

Suppose  now  that  we  have  a  convex  combination  of  two  disks  of  different  radii  but  the 
same  centers.  A  similar  argument  can  be  used  to  show  that  once  again  the  only  principal 
curves  are  the  lines  through  the  center.  This  then  generalizes  to  any  mixture  of  uniform 
disks  and  hence  to  any  spherically  symmetric  distribution  of  this  form. 

We  conjecture  that  for  ellipsoidal  distributions  the  only  principal  curves  are  the  prin¬ 
cipal  components. 
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Algorithmic  details 


In  this  chapter  we  describe  in  more  detail  the  various  constituents  of  the  principal  curve 
and  surface  algorithms. 

5.1.  Estimation  of  curves  and  surfaces. 

We  described  a  simple  smooth  or  local  averaging  procedure  in  chapter  4.  There  it  was 
convenient  to  describe  the  smoother  ss  a  method  of  averaging  in  p  space,  although  it  has 
been  pointed  out  that  we  can  do  the  smoothing  co-ordinate  wise.  That  simplifies  the 
treatment  here,  since  we  only  need  to  discuss  smoothers  in  their  more  ususi  regression 
context. 

Usually  a  ecatterplot  smoother  is  regarded  as  an  estimate  of  the  conditional  expectation 
X(y  |  X),  where  Y  and  X  are  random  variables.  For  our  purports  X  may  be  one  or  two 
dimensional.  We  will  discuss  one  dimensional  smoothers  first,  since  they  are  easier  to 
implement  than  two  dimensional  smoothers. 

5.1.1.  On«  dimensional  smoother*. 

The  following  subnet  of  wnootbers  evolved  naturally  as  estimates  of  conditional  expectation, 
and  are  listed  in  order  of  complexity  and  computational  coet. 

5. 1.1.1  Moving  average  smoothers. 

The  simplest  and  most  natural  estimate  of  E(Y  |  X)  is  tbs  roaring  average  smoother. 
Given  a  sample  (yu,  *<),  i  =  1, . . . ,  n,  with  the  *i  in  ascending  order,  w»  define 

Smoothly  | *i)=  — |-t  Vi  (5.1) 

where  k  =*  [(ns  -  l)/2j  and  e  €  (0, 1}  is  called  the  span  of  the  smoother.  An  estimate  of 
the  conditional  expectation  at  x*  is  the  average  of  the  yy  for  all  those  observations  with  z 
value  equal  to  *i.  Sines  we  usually  only  have  one  such  observation,  we  average  the  y,  for 
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all  those  observations  with  x  value  close  to  Zj.  In  the  definition  above,  close  is  defined  in 
the  ordinal  scale  or  in  ranks.  We  can  also  use  the  interval  scale  or  simply  distance,  but  this 
is  computationally  more  expensive.  This  moving  average  smoother  suffers  from  a  number 
at  drawbacks,  it  does  not  produce  very  smooth  fits  and  does  not  even  reproduce  straight 
lines  unless  the  x,  are  squ  is  paced.  It  also  suffers  from  bias  effects  on  the  boundaries. 

5.1.1.3  Local  linear  smoothers. 

An  improvement  on  the  moving  average  smoother  is  the  local  linear  smoother  of  Friedman 
and  Stuetsle  (1981).  Here  the  smoother  estimates  the  conditional  expectation  at  x*  by 
the  fitted  value  from  the  least  squares  line  fit  of  y  on  i.  using  only  those  points  for  which 
Zj  €  (xi-a ,Xi+s).  This  suffers  lass  from  boundary  bias  than  the  moving  average  and  always 
reproduces  straight  lines  exactly.  The  coet  of  computation  for  both  of  the  above  smoothers 
is  O(n)  operations.  Of  course  we  can  think  of  fitting  local  polynomials  as  well,  but  in 
practice  the  gain  in  bias  is  email  relative  to  the  extra  computational  burden. 

5.1. 1.3  Locally  weighted  linear  smoothers. 

Cleveland  (1979)  suggested  using  the  local  linear  smoother,  but  also  suggested  weighting 
the  points  in  the  neighborhood  according  to  their  distance  in  z  from  z,-.  This  produces  even 
■noother  curves  at  the  expense  of  an  increased  computation  time  of  O(kn)  operations.  (In 
the  local  linear  smoother,  we  can  obtain  the  fitted  value  at  x<+t  from  that  at  z*  by  applying 
some  simple  updating  algorithm  to  the  latter.  If  local  weighting  is  performed,  we  can  no 
longer  use  updating  formulae.) 

5.1. 1.4  Kernel  smoother*. 

The  kernel  smoother  (Gaaaer  and  Muller,  1979)  applies  a  weight  function  to  every  observa¬ 
tion  in  calculating  the  fit  at  x,.  A  variety  of  weight  functions  or  kernels  exist  and  a  popular 
choice  is  the  gauastan  kernel  centered  at  z\.  They  produce  the  smoothest  functions  and  are 
computationally  the  most  expensive.  The  coet  is  0(n*)  operations,  although  in  practice 
the  kernels  have  a  bounded  domain  and  this  brings  the  coet  down  to  O(sn)  for  some  «  that 
depends  on  the  kernel  and  the  data. 

In  all  but  the  kernel  smoother,  the  span  controls  the  smoothness  of  the  estimated 
function.  The  larger  the  span,  the  smoother  the  function.  In  the  esse  of  the  kernel  smoother, 
there  is  a  seals  parameter  that  controls  the  spresd  of  the  kernel,  and  the  larger  the  spread, 
the  smoother  the  function.  We  will  discuss  the  choice  of  spans  in  section  5.4. 
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For  our  particular  application,  it  waa  found  that  the  locally  weighted  linear  smoother 
and  the  kernel  smoother  produced  the  meet  satisfactory  results.  However,  when  the  sample 
sue  gets  large,  these  smoothers  become  too  expensive,  and  we  have  to  sacrifice  smoothness 
for  computational  speed.  In  this  case  we  would  use  the  faster  local  linear  smoother. 

6.1.2.  Two  dimensional  smoothers. 

There  are  substantial  differences  between  one  and  two  dimensional  smoothers.  When  we 
find  neighbors  in  two  space,  ws  immediately  force  some  metric  on  the  space  in  the  way  we 
define  distance.  In  our  algorithm  we  simply  use  the  euclidean  distance  and  assume  the  two 
variables  are  in  the  same  scale. 

It  is  also  computationally  harder  to  find  neighbors  in  two  dimensions  than  in  one.  The 
k-d  tree  (  Friedman,  Bently  and  Finkel,  1976)  is  an  efficient  algorithm  and  data  structure  for 
finding  neighbors  in  k  dimensions.  The  name  arises  from  the  data  structure  used  to  speed 
up  the  search  time  —  a  binary  tree.  The  technique  can  be  thought  of  as  a  multivariable 
version  of  the  binary  search  routine.  Friedman  et  al  show  that  the  computation  required 
to  build  the  tree  is  0(kn  log  n)  and  the  expected  search  time  for  the  m  nearest  neighbors 
of  any  point  is  0(log  n). 

5.1.3.  The  local  planar  surface  smoother. 

We  wish  to  find  Smooth  (y  |  *o)  where  sq  is  a  2- vector  not  necessarily  present  in  the  sample. 
The  following  algorithm  is  analogous  to  the  local  linear  smoother: 

•  Build  the  2-d  tree  for  the  n  pairs  *»«)• 

•  Find  the  ns  nearest  neighbors  of  so,  and  fit  the  least  squares  plane  through  their 
associated  y  values. 

•  The  smooth  at  so  is  defined  to  be  the  fitted  value  at  so. 

This  algorithm  does  not  allow  updating  as  in  the  one-dimensional  local  linear  smoother. 
The  computation  time  for  one  fitted  value  is  0(log  n  +  ns).  For  this  reason,  we  can  include 
weights  at  no  extra  order  in  computation  cost.  We  use  gauasian  weights  with  covariance 
k*I  and  centered  at  so,  and  h  is  another  parameter  of  the  procedure. 

A  simpler  version  of  this  smoother  uses  the  (gauasian  weighted)  average  of  the  y  values 
tor  the  ns  neighbors.  In  the  one  dimensional  case,  we  find  that  fitting  local  straight  lines 
reduces  the  bias  at  the.  boundaries.  In  surface  smoothing,  the  proportion  of  points  on  the 


Chapter  5:  Algorithmic  details  65 


boundary  increases  dramatically  as  we  go  from  one  to  two  dimensions.  This  provides  a 
strong  motivation  for  fitting  planes  instead  of  simple  averages. 

5.2.  The  projection  step. 

Tbs  other  step  in  the  principal  curve  and  surface  procedures  is  to  project  each  point  onto 
the  current  surface  at  curve.  In  our  notation  we  require  (*,)  for  each  i.  We  have  already 
described  the  exact  approach  in  chapter  3  for  principal  curves,  which  we  repeat  here  for 
completeness. 

5.2.1.  Projecting  by  exact  enumeration. 

We  project  into  the  line  segment  joining  every  adjacent  pair  of  fitted  values  of  the  curve, 
and  find  the  closest  such  projection.  Into  implies  that  when  projecting  we  do  not  go  beyond 
tbs  two  points  in  question.  This  procedure  is  exact  but  computationally  expensive  (O(n) 
operations  per  search.)  Nonetheless,  we  have  used  this  method  on  the  smaller  data  sets 
(<  150  observations.)  There  is  no  analogue  for  the  principal  surface  routine. 

5.2.2.  Projections  using  the  k-d  tree. 

At  each  of  the  n  values  of  i  we  have  a  fitted  p  vector.  This  is  true  for  either  the  principal 
curve  or  surface  procedure.  We  can  build  a  p-d  tree,  and  for  each  x*,  find  its  nearest 
neighbor  amongst  these  fitted  values.  We  then  proceed  differently  for  curves  and  surfaces. 

e  For  curves  we  project  tbs  point  tale  the  segments  joining  this  nearest  point  and  its 
left  neighbor.  We  do  tbs  same  for  the  right  neighbor  and  pick  the  closest  projection. 

e  For  surfaces  wo  find  the  nearest  fitted  value  as  above.  Suppose  this  is  at 

We  then  project  *i  onto  the  plane  corresponding  to  this  fitted  value  and  get  a  new 
value  A*.  (This  plans  has  already  been  calculated  in  the  smoothing  step  and  is  stored.) 
We  then  evaluate  ^rt(A*)  and  check  if  it  is  indeed  closer.  (This  precautionary  step 
is  similar  to  projecting  into  the  line  segments  in  the  case  of  curves.)  If  it  is,  we 
set  a  A*,  else  we  set  One  could  think  of  iterating  this  procedure, 

which  is  similar  to  a  gradient  search.  Alternatively  one  could  perform  a  Newton- 
Raphson  search  using  derivative  information  contained  in  the  least  squares  planes. 
These  approaches  are  expensive,  and  in  the  many  examples  tested,  made  little  or  no 
difference  to  the  estimate. 
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5.2.3.  Rescaling  the  A’s  to  arc-length. 

In  the  principal  curve  procedure,  a a  a  matter  of  practice,  we  always  rescale  the  A’s  to  arc- 
length.  The  estimated  A's  are  then  measured  in  the  same  units  as  the  observations.  Let  A' 
denotes  the  rescaled  A^’s,  and  suppose  a|^  are  sorted.  We  define  A?  recursively  as  follows: 

•  X{  =  0. 

•  = k-i + jW?1)  -  /0)(aK,1)|- 


Unscaled  A 


Figure  (5.1)  A  A  plot  for  the  circle  example.  Along  the  vertical  axio 
we  plot  the  final  vaieee  for  X* ,  after  rescaling  the  Vs  at  every  iteration 
in  the  principal  carve  procedure.  Along  the  horisoatal  axis  we  have 
the  final  Vs  using  the  principal  curve  procedure  with  no  rescaling. 


In  general  there  is  no  analogue  of  rescaling  to  arc-length  for  surfaces.  Surface  arsais  the 
corresponding  quantity.  We  can  adjust  the  parameters  locally  so  that  the  area  of  a  small 
region  in  parameter  space  has  the  same  area  as  the  region  it  defines  on  the  surface.  But 
this  adjustment  will  be  different  in  other  regions  of  the  surface  having  the  same  values  for 
one  of  the  parameters.  The  exceptions  are  surfaces  with  sero  gaussian  curvature.  (These 
are  surfaces  that  can  be  obtained  by  emoothlg  denting  a  hyperplane  to  form  something  like 
a  corrugated  sheet.  One  can  imagine  that  such  a  rescaling  is  then  passible). 
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Figure  (5.2)  Each  iteration  approximately  preserve*  the  metric 
from  the  previous  one.  The  starting  carve  is  unit  speed,  and  so  the 
final  carve  is  approximately  so,  up  to  a  constant. 

Even  though  it  is  not  possible  to  do  such  a  rescaling  for  surfaces,  it  would  be  comforting 
to  know  that  our  par  ante  trisation  remains  reasonably  consistent  over  the  surface  as  we  go 
through  the  iterations. 

Figure  5.1  demonstrates  what  happens  if  we  use  the  principal' curve  procedure  on  the 
circle  example,  and  do  not  rescale  the  parameter  estimates  at  each  iteration.  The  metric 
gets  preserved,  up  to  a  scalar.  Figure  5.2shows  why  this  is  so.  The  original  metric  gets 
transferred  from  one  iteration  to  the  next.  As  long  as  the  curves  do  not  change  dramatically 
from  one  iteration  to  the  next,  there  will  not  be  much  distortion. 

5.3.  Span  selection. 

We  consider  there  to  be  two  categories  of  spans  corresponding  to  two  distinct  stages  in  the 
algorithm. 

5.3.1.  Global  procedural  spans. 

The  first  guess  for  /  is  a  straight  line.  In  many  of  the  interesting  situations,  the  final 
curve  will  not  be  a  function  of  the  arc  length  of  this  initial  curve.  The  final  curve  is 
reached  by  successively  handing  the  original  curve.  We  have  found  that  if  the  initial  spans 
of  the  smoother  are  too  small,  the  curve  will  bend  too  fast,  and  may  get  lost!  The  moet 
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successful  strategy  has  been  to  initially  use  large  spans,  and  then  to  decrease  them  slowly 
In  particular,  we  start  with  a  span  of  O.Sn,  and  let  the  procedure  converge.  We  then  drop 
the  span  to  0.4n  and  converge  again.  Finally  the  same  is  done  at  0.3n  by  which  time  the 
procedure  has  found  the  general  shape  of  the  curve.  We  then  switch  to  mean  square  error 
(MSE)  span  selection  mode. 

5.3.2.  Mean  squared  error  spans. 

The  procedure  has  converged  to  a  self  consistent  curve  for  the  spaa  last  used.  If  we  reduce 
the  span,  the  average  distance  will  decrease.  This  situation  arises  in  regression  as  well.  In 
regression,  however,  there  is  a  remedy.  We  can  use  cross-validation  (Stone  1977)  to  select 
the  span.  We  briefly  outline  the  idea. 

5.3.2.1  Cross-validation  in  regression. 


Suppose  we  have  a  sample  of  n  independent  pairs  (yi,Xi)  from  the  model  Y  —  f(X)  + 1. 
A  nonpar ametric  estimate  of  /(x o)  is  /,(x o)  =  Smooth ,  (y  |*o).  The  expected  squared 
prediction  error  is 

EPE=  E(y -/(*))*  (5.2) 


where  the  expectation  is  taken  over  everything  random  (i.e.  the  sample  used  to  estimate 
/(•)  and  the  future  pairs  (X,Y)).  We  use  the  residual  sum  of  squares. 


*SS(e)  =  £>-/,(*,))*, 

as  the  natural  estimate  of  EPE.  This  is  however,  a  biassed  estimate,  as  can  be  i  leen  by 
letting  the  span  s  shrink  down  to  0.  The  smpoth  then  estimates  y,-  by  itself,  and  RSS  is  < 
zero.  We  call  this  bias  due  to  overfiiting  since  the  bias  is  due  to  the  influence  y<  has  in 
forming  ite'  own  prediction.  This  also  shows  us  that  we  cannot  use  RSS  to  help  us  |  ick  the 
span.  We  can,  however,  use  the  cross-validated  residual  sum  of  squares  (CVRSS).  This  is 
defined  as 

CVRSS(o)  =  £^(yj  -  Smoothly  |xj))a. 


_  (5-3) 

where  Smooth  I’*(y  |  Xi)  is  the  smooth  calculated  from  the  data  with  the  pair  (y,-  x,-)  re¬ 
moved,  and  then  evaluated  at  Xf.  It  can  be  shown  that  this  estimate  is  approximately 
unbiassed  for  the  true  prediction  error.  In  minimiziug  the  prediction  error,  we  als  j  mini- 
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mile  the  integrated  mean  square  error  EMSE  given  by 

EMSE(s)  =  E(/,pO  -  /(X))* 

since  they  differ  by  a  constant.  We  can  decompose  this  expression  into  a  sum  of  a  variance 
and  bias  terms,  namely 

EMSE(s)  =  E[  Var(/,(J0]  +  *[(  *{f.(X)  \X)  -  /(^))s] 

=  VAR(s)  +  BIASi(s). 

As  s  gets  smaller  the  variance  gets  larger  (averaging  over  less  points)  but  the  bias  gets 
smaller  (width  of  the  neighborhoods  gets  smaller),  and  vice  versa.  Thus  if  we  pick  s  to 
minimise  CVRSS(s)  we  are  trying  to  minimise  the  true  prediction  error  or  equivalently  to 
find  the  span  which  optimally  mixes  bias  and  variance. 

Getting  back  to  the  curves,  one  thought  is  to  cross- valid  ate  the  orthogonal  distance 
function.  This,  however,  will  not  work  because  we  would  still  tend  to  use  span  zero.  (In 
general  we  have  more  chance  of  being  close  to  the  interpolating  curve  than  any  other  curve). 
Instead,  we  cross- valid  ate  the  co-ordinates  separately. 

5.3.2.S  Cross-validation  for  principal  curves. 

Suppose  /  is  a  principal  curve  of  h,  for  which  we  have  an  estimate  /  based  on  a  sample 
^1»  •  •  | 

A  natural  requirement  is  to  choose  *  to  minimize  EMSE{s)  given  by 
EMSE(s)  =  E*  ||/(A/(X))  -  /.(A/(X))f 

=  E  *»a(  Var (/.(*,(*))  |  A,(x))  +  Ehx  |/(A/(JT))  -  /*(Aj(X))|*  (5.4) 

which  is  once  again  a  trade-off  between  bias  and  variance.  Notice  that  were  we  to  look  at  the 
closest  distance  between  these  curves,  then  the  interpolating  curve  would  be  favored.  As  in 
the  regression  case,  the  quantity  EPE{s)  -  Ek  ||x  -  /,(Aj(JT))||  estimates  EMSE(s)  + 

0(/),  where  D(f)  =  E  ||x  -  /{Aj(X))||  .  It  is  thus  equivalent  to  choose  s  to  minimize 
EMSE{s)  or  EPE(s).  As  in  the  regression  case,  the  cross-validated  estimate 

CV*SS(.)  =  £  Smooth^*,- |A,))»],  (5.5) 

i*i  Li*i  J 
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where  A,-  =  A attempts  to  do  this.  Since  we  do  not  know  A,-,  we  pick  At-  =  A^((k) (*,) 
where  /(*)  is  the  (non  croes-validated)  estimate  of  /.  In  practice,  we  evaluate  CVRSS(s) 
for  a  few  values  of  s  and  pick  the  one  that  gives  the  minimum. 

From  the  computing  angle,  if  the  smoother  is  linear  one  can  easily  find  the  cross- 
validated  fits.  In  this  case  y  —  Cy  for  some  smoother  matrix  C,  and  the  cross-validated  fit 
V(i)  «  giv«n  by  y(i)  =  £,•*•  (Wahba  1975). 

There  are  a  number  of  issues  connected  with  the  algorithms  that  have  not  yet  been 
mentioned,  such  as  a  robustness  and  outlier  detection,  what  to  display  and  how  to  do  it, 
and  bootstrap  techniques.  The  next  chapter  consists  of  many  examples,  and  we  will  deal 
with  these  issues  as  they  arise. 
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Examples 

This  chapter  contains  six  examples  that  demonstrate  the  procedures  on  real  and  simulated 
data.  We  also  introduce  some  ideas  such  as  bootstrapping,  robustness,  and  outlier  detection. 

Example  6.1.  Gold  assay  pairs. 

This  real  data  example  illustrates: 

•  A  principal  curve  in  2-space, 

•  non-linear  errors  in  variables  regression, 

•  co-ordinate  function  plots,  and 

•  bootstrapping  principal  curves. 

A  California  baaed  company  collects  computer  chip  waste  in  order  to  sell  it  for  its  content 
of  gold  and  other  precious  metals.  Before  bidding  for  a  particular  cargo,  the  company  takes 
a  sample  in  order  to  estimate  the  gold  content  of  the  the  whole  lot.  The  sample  is  split  in 
two.  One  sub-sample  is  assayed  by  an  outside  laboratory,  the  other  by  their  own  inhouse 
laboratory.  (The  names  of  the  company  and  laboratory  are  withheld  by  request).  The 
company  wishes  to  eventually  use  only  one  of  the  assays.  It  is  in  their  interest  to  know 
which  laboratory  produces  on  average  lower  gold  content  assays  for  a  given  sample. 

The  data  in  figure  6.1a  consists  of  250  pairs  of  gold  assays.  Each  point  is  represented 

by 

where  =  log(l  +  assay  yield  for  ith  assay  pair  for  lab  j)  and  where  j  —  1  corresponds 
to  the  inhouse  lab  and  j  =  2  the  outside  lab.  The  log  transformation  tends  to  stabilize  the 
variance  and  produce  a  more  even  scatter  of  points  than  in  the  untransformed  data.  (There 
were  many  more  small  assays  (1  os  per  ton)  than  larger  ones  (>  10  os  per  ton)). 
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Figure  6.1a  Plot  of  the  log  assays  for  the 
inhouse  and  outside  labs.  The  solid  curve  is  the 
principal  curve,  the  dashed  curve  the  scatter- 
plot  smooth. 


Figure  6.1b  Estimated  coordinate  func¬ 
tions.  The  dashed  curve  is  the  outside  lab,  the 
solid  curve  the  inhouse  lab. 


A  standard  analysis  might  be  a  paired  t-test  for  an  overall  difference  in  assays.  This 
would  not  reflect  local  differences  which  can  be  of  great  importance  since  the  higher  the 
level  of  gold  the  more,  important  the  difference. 

The  data  was  actually  analyzed  by  smoothing  the  differences  in  log  assays  against  the 
average  of  the  two  assays.  This  can  be  considered  a  form  of  symmetric  smoothing  and  was 
suggested  by  Cleveland  (1983).  We  discuss  the  method  further  in  chapter  7. 

The  model  presented  here  for  the  above  data  is 


f*i.  \  _  //i(n)\  +  f  ei.  \ 

V*W/  \  fi(Ti)  J  [e*J 


(6.1) 


where  r,-  is  the  unknown  true. gold  content  for  sample  <  (or  any  monotone  function  thereof), 
fj(r{)  is  the  expected  assay  result  for  lab  j,  and  ey,-  is  measurement  error.  We  wish  to 
analyze  the  relationship  between  f\  and  /s  for  different  true  gold  contents. 

This  is  a  generalization  of  the  errors  in  variables  model  or  the  structural  model  (if  we 
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regard  the  u  themselves  as  unobservable  random  variables),  or  the  functional  model  (if  the 
T{  are  considered  fixed).  This  model  is  traditionally  expressed  as  a  linear  model: 

ft)  ■("")*(:) 

where  /j(r,)  =  z,  and 

/i(r,)  =  fx  o  /,_1(z,)  (assuming  /j  is  monotone) 

==  a  +  ft* 

It  suffers,  however,  from  the  same  drawback  as  the  t-test  in  that  only  global  inference  is 
possible. 

We  assume  that  the  ey,-  are  pairwise  independent  and  that  * 

Var(eu)  =  Var(ej.)  V  t. 

The  model  is  estimated  Using  the  principal  curve  estimate  for  the  data  and  is  repre¬ 
sented  by  the  solid  curve  in  figure  6.1a.  The  dashed  curve  is  the  usual  scatterplot  smooth 
of  X]  against  xj  and  is  clearly  misleading  as  a  scatterplot  summary.  The  curve  lies  above 
the  45°  line  in  the  interval  1.4  1 4  which  represents  an  untransformed  assay  interval  of  3  to 
15  oz/ton.  In  this  interval  the  inhouse  average  assay  is  lower  than  that  of  the  outside  lab. 
The  difference  is  reversed  at  lower  levels,  but  this  is  of  less  practical  importance  since  at 
these  levels  the  cargo  is  less  valuable.  This  is  more  clearly  seen  by  examining  the  estimated 
coordinate  function  plots  in  figure  6.1b. 

A  natural  question  arising  at  this  point  is  wether  the  kink  in  the  curve  is  real  or  not. 
If  we  had  access  to  more  data  from  the  same  population  we  could  simply  calculate  the 
principal  curves  for  each  and  see  how  often  thu  kink  is  reproduced.  We  could  then  perhaps 
construct  a  95%  confidence  tube  for  the  true  curve. 

In  the  absence  of  such  repeated  samples,  we  use  the  bootstrap  (Efron  1981,  1982)  to 
simulate  them.  We  would  like  to,  but  Cannot,  generate  samples  of  size  n  from  F,  the  true 
distribution  of  x.  Instead  we  generate  samples  of  size  n  from  F,  the  empirical  or  estimated 
distribution  function,  which  puts  mass  1/n  cn  each  of  the  sample  points  x,-.  Each  such 
sample,  which  samples  the  points  z,-  with  replacement,  is  called  a  bootstrap  sample. 

*  In  the  linear  model  one  usually  requires  that  V ar(eJt)  =»  constant,-.  This  assumption  can  be 

relaxed  here. 
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Figure  6.1e  25  bootstrap  curves.  The  data  X  is  sampled  25  times 
with  replacement,  each  time  yielding  a  bootstrap  sample  AT*.  Each 
curve  is  the  principal  curve  of  such  a  sample. 

Figure  6.1c  shows  the  principal  curves  obtained  for  25  such  bootstrap  samples.  The 
45°  line  is  included  in  the  figure,  and  we  see  that  none  of  the  curves  cross  the  line  in  the 
region  of  interest.  This  provides  strong  evidence  that  the  kink  is  indeed  real. 

When  we  compute  a  particular  bootstrap  curve,  we  use  the  principal  curve  of  the 
original  sample  as  a  starting  value.  Usually  one  or  two  iterations  are  all  that  is  required 
for  the  procedure  to  converge.  Also,  since  each  of  the  bootstrap  points  occurs  at  one  of  the 
sample  sites,  we  know  where  they  project  onto  this  initial  curve. 

It  is  tempting  to  extract  from  the  procedure  estimates  of  r,-,  the  true  gold  level  for 
sample  t.  However,  ?,  need  not  be  the  true  gold  level  at  all.  It  may  be  any  variable  that 
orders  the  pairs  /(?,■)  along  the  curve,  and  is  probably  some  monotone  function  of  the  true 
gold  level.  It  is  clear  that  both  labs  could  consistently  produce  biased  estimates  of  the  true 
gold  level  and  there  is  thus  no  information  at  all  in  the  data  about  the  true  level. 

Estimates  of  r,-  do  provide  us  with  a  good  summary  variable  for  each  of  the  pairs,  if 
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that  La  required: 


U  =  A(*i) 

since  we  obtain  ?,•  by  projecting  the  point  z,  onto  the  curve.  Finally  we  observe  that  the 
above  analysis  could  be  extended  in  a  straightforward  way  to  include  3  or  more  laboratories. 
It  is  hard  to  imagine  how  to  tackle  the  problem  using  standard  regression  techniques. 

Example  6.2.  The  helix  in  three-space. 

This  is  a  simulated  example  illustrating: 

•  A  principal  curve  in  3-space , 


•  co-ordinate  plots,  and 

•  cross-validation  and  span  selection. 

We  looked  at  the  bias  of  the  principal  curve  procedure  in  estimating  the  helix  in  chapter  4. 
We  now  demonstrate  the  procedure  by  generating  data  from  that  model.  We  have 

/*n(4xA)\ 

/(A)  =  I  eos(4*A)  i  Hbe, 

where  A  —  £/[0, 1]  and  *  ~  M(0,.ZI).  This  situation  does  not  present  the  principal  curve 
procedure  with  any  real  problems.  The  reason  is  that  the  starting  vector  passes  down  the 
middle  of  the  helix  and  the  data  projects  onto  it  in  nearly  the  correct  order.  Table  6.1shows 
the  steps  in  the  iterations  as  the  procedure  converges  at  each  of  the  procedural  spans  shown. 
At  a  span  of  «  =  .2  we  use  cross-validation  to  find  the  minimum  mse  span. 

Figure  6.2c  shows  the  CV RS S  curve  used  to  select  the  span,  which  is  0.1  with  a 
value  of  CV RS$  of  0.1944.  One  more  step  is  performed  and  the  procedure  is  terminated. 
Figure  6. 2d  shows  the  estimated  co-ordinate  functions  for  this  choice  of  span.  We  see 
that  the  estimate  of  the  linear  co-ordinate  is  rather  wiggly.  It  is  clear  that  a  small  span 
was  required  to  estimate  the  sinusoidal  co-ordinates,  but  a  large  span  would  suffice  for 
the  linear  co-ordinate.  This  suggests  a  different  scheme  for  cross-validation — choosing  the 
spans  separately  for  each  co-ordinate.  The  results  are  shown  in  figures  d.2e  and  6.2f.  As 
predicted,  a  larger  span  is  chosen  for  the  linear  coordinate,  and  its  estimate  is  no  longer 
wiggly.  This  is  the  final  model  referred  to  in  tue  table  and  represented  in  figure  6.2. 
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Figure  6.3a  Data  fowtud  from  a  he*  Figure  6.2b  A  act  bar  view  of  the  helix,  the 
lix  with  independent  error*  oa  each  coordinate.  data  asd  the  principal  carve. 

The  dashed  carve  is  the  original  helix,  the  solid 
carve  is  the  principal  carve  estimate. 


Table  6.1.  The  steps  in  the  iterations.  Initially  the  procedure 
goes  through  a  regimen  of  procedural  spans.  Then  the  final  spaa  is 
found  by  cross  validation. 


CVRSSj(a)  CVRSS(a) 


Chapter  6:  Examples  77 


aia  *- 

0.06 


Span  s 

Figure  6.3c  The  cross  validation  cim 
shows  CVRSS(s)  as  a  function  at  the  span  a. 
One  apaa  ia  need  for  all  3  co-ordinates. 


Estimated  X 

Figure  6.3d  The  estimated  co-ordinate 
faactioaa  for  tka  helix,  oaing  the  apaa  found  in 
figere  6.2c. 


0.1  0.8 

Span  s 

Figure  6.3e  The  cross-validation  carve 
shows  CV RSSj(s)  aa  a  fnactioa  at  the  apaa  a. 
A  aeparata  apaa  ia  foand  for  each  co-ordinate. 


Estimated  X 

Figure  6.2f  The  estimated  co-ordinate  func¬ 
tions  for  the  helix,  uaing  the  spans  found  in  fig¬ 
ure  6.2f. 
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The  entry  labelled  d.oX  in  table  6.  lie  an  abbreviation  for  degrees  of  freedom.  In 
linear  regression  the  number  of  parameters  used  in  the  fit  is  given  by  tr(fT)  where  H  is  the 
projection  or  hot  matrix.  If  the  response  variables  y,  are  iid  with  variance  or1,  then 

53  V#r(&)  =  51  VarMv) 

»=l  «=i 

=  <r*tr  (ff) 

We  can  do  the  same  calculation  for  a  linear  smoother  matrix  C,  and  in  fact  for  the  local 
straight  lines  smoother  we  even  have  tr  (C*C)  =  tr(C).  As  the  span  decreases,  the  diagonal 
entries  of  C  get  larger,  and  thus  the  variance  of  the  estimates  increases,  as  we  would  expect. 
One  can  also  approach  this  from  the  other  side  by  looking  at  the  residual  sum  of  squares. 
In  the  absence  of  bias  we  h/.ve 

ZRSS  =  ■  11(7-0)111* 

=  *»»(/-  C)\I-C)g 
~  tr((/  —  C)\I  -  C)  Cov(y)l 
sr  (n  -  tr  (C))<r* 

if  tr(C'C)  =  tr(C).  *  More  motivation  for  regarding  tr  (C)  as  the  number  of  parameters  or 
d.oX  can  be  found  in  Cleveland  (1979)  and  Tibshirani  (1984).  Some  calculations  similar  to 
those  in  3.5.1  show  that  the  expected  squared  distance  of  X  from  the  true  /  is  D2  fa  2a1,  or 
more  precisely  O3  m  2a1— a4 /(ip1)  where  p  is  the  radius  of  curvature,  which  in  our  example 
is  1  +  l/*1.  Thus  D1  m  0.18.  The  cross  validated  residual  estimate  £CV  RSSj  was  found 
to  be  0.189. '  The  orthogonal  distance  from  the  final  curve  is  Z?2(n)  =  0.162.  This  is  deflated 
due  to  overfitting.  The  average  value  of  d.o.f  for  the  final  curve  is  (one  for  each  co-ordinate) 
9.7,  or  a  total  of  29.1.  Some  simple  heuristics  show  that  the  we  should  scale  this  value  up  by 
by  2n/(2n  -  d.o.f )  =  300/(300  -  29.1)  =  1.11.  We  then  get  2n/(2n  -  d.o./)D*<n>  =  0.179 
which  is  back  in  the  correct  ballpark. 

It  is  more  convenient  to  view  the  3  dimensional  examples  on  a  color  graphics  system 
(such  as  the  Chromatics  system  of  the  Orion  group,  Stanford  University).  This  allows  one 
to  rotate  the  points  in  real  time  and  thus  see  the  3rd  dimension. 

*  For  oar  smoothers,  each  row  of  C  is  the  row  of  a  projection  matrix,  and  hence  e'e,-  =*  e«. 
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Example  6.3.  Geological  data. 

This  real  data  example  illustrates: 

•  Data  modelling  in  3  dimensions, 

•  non-linear  factor  analysis,  and 

•  outlier  detection  and  robust  fitting. 

The  data  in  this  example  consists  of  measurements  of  the  mineral  content  of  64  core  samples, 
each  taken  at  different  depths  (Chernoff,  1973).  Measurements  were  made  of  10  minerals 
in  each  sample.  We  simply  label  the  minerals  Xt,-“  ,Xvo,  and  analyze  the  first  three. 


x- 

e 

u 

e 

a 
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Mineral  X, 

Figure  6.3a  The  principal  curve  for  the  mineral  data.  (Variable 
X3  is  into  the  page).  The  spikes  join  the  points  to  their  projection 
on  the  carve.  The  4  outliers  are  joined  to  the  curve  with  the  broken 
lines. 

Figure  6.3a  shows  the  data  and  the  solution  curve.  (A  final  span  of  0.35  was  manually 
selected.)  In  3-D  the  picture  looks  like  a  dragon  with  its  tad  pointing  to  the  left  and  the 
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Depth  order  of  core 


Figure  6.3b  The  value*  are  plotted  againet  the  depth 

order  of  the  core  samples. 

long  (outlier)  spikes  could  be  a  mane.  The  linear  principal  component  explains  55%  of  the 
variance,  whereas  this  solution  explains  82%. 

The  spikes  join  the  observations  to  their  closest  projections  on  the  curve.  This  is  a 
useful  device  for  spotting  outliers.  A  robust  version  of  the  principal  curve  procedure  was 
used  in  this  example.  After  the  first  iteration,  points  receive  a  weight  which  is  inversly 
proportional  to  their  distance  from  the  curve.  In  the  smoothing  step,  a  weighted  smooth 
is  used,  and  if  the  weight  is  below  a  certain  threshhold,  it  is  set  to  0.  Four  points  were 
identified  as  outliers,  and  are  labelled  differently  in  figure  6.3a  .  We  would  really  consider 
them  model  outliers,  since  in  that  region  of  the  curve  the  model  does  not  appear  to  fit  very 
well. 

Figure  6.35  shows  the  relationship  between  the  order  of  the  points  on  the  curve,  and 
the  depth  order  of  the  core  samples.  The  curve  appears  to  recover  this  variable  for  the  most 
part.  The  area  where  it  does  not  recover  the  order  is  where  the  curve  appears  to  fit  the 
data  badly  anyway.  So  here  we  have  uncovered  a  hidden  variable  or  factor  that  we  are  able 
to  validate  with  the  additional  information  we  have  about  the  ordering.  The  co-ordinate 
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0  2  4  s  B 

Estimated  X 


Figure  6.3c  Tfc«  Mtimaud  co-ordin*t«  function*  or  factor  loading 

carves  for  the  three  miaer*]*. 

plot,  would  then  represent  the  mean  level  of  the  particular  mineral  at  different  depths  (see 
figure  6.3c  ).  Usually  one  would  have  to  use  these  co-ordinate  plots  to  identify  the  factors, 
just  as  one  uses  the  factor  loadings  in  the  linear  case. 

Example  6.4.  The  uniform  ball. 

This  example  illustrates: 

•  A  principal  surface  in  3  space,  and 

•  a  connection  to  multidimensional  scaling. 

The  data  is  artificially  constructed,  with  no  noise,  by  generating  points  uniformly  from  the 
surface  of  a  sphere.  It  is  the  same  data  used  by  Shepard  and  Carroll  (1966)  to  demonstrate 
their  parametric  mapping  algorithm,  (see  reference  and  chapter  7).  We  simply  use  it  here 
to  demonstrate  the  ability  of  the  principal  surface  algorithm  to  produce  surfaces  that  are 
not  a  function  of  the  starting  plane  (in  analogy  to  the  circle  example  in  chapter  3). 

There  are  61  data  points,  as  shown  in  figure  6.4a.  One  point  is  placed  at  each 
intersection  of  5  equally  spaced  parallels  and  12  equally  spaced  meridians.  The  extra  point 
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Figure  0.4a  The  daU  point*  are  placed 
in  a  uniform  pattern  on  the  surface  of  a  sphere. 
The  south  pole  is  missing. 


Figure  6.4b  The  second  iteration  of  the 
principal  surface  procedure  finds  a  surface  that 
is  a  function  of  the  first  iteration. 


Figure  6.4c  An  intermediate  stage  in  the 
iterations. 


Figure  6.4d  The  final  surface  produced  by 
the  principal  surface  routine. 
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Estimated  Xt 


Figure  6.4a  Another  view  of  the  final  prin*  Figure  6.4f  The  A  map  ia  a  two  dimensional 
cipal  surface.  summary  of  the  data.  It  resembles  a  stereo¬ 

graphic  map  of  the  world. 

is  placed  at  the  north  pole.  (If  we  placed  a  point  at  the  south  pole  the  principal  surface 
procedure  would  never  move  from  the  starting  plane,  which  is  in  fact  a  principal  surface.) 
Figures  6.4b  to  6.4d  show  various  stages  in  the  iterative  procedure,  and  figure  6.4e  shows 
another  view  of  the  final  surface.  Figure  6.4f  is  a  parameter  map  of  the  two  dimensional  A. 
It  resembles  a  stereographic  map  of  the  earth.  (A  stereographic  map  is  obtained  by  placing 
the  earth,  or  a  model  thereof,  on  a  piece  of  paper.  Each  point  on  the  surface  is  mapped 
onto  the  paper  by  extrapolating  the  line  segment  joining  the  north  pole  to  the  point  until 
it  reaches  the  paper.)  Points  in  the  southern  hemisphere  are  mapped  on  the  inside  of  a 
circle,  points  in  the  northern  hemisphere  on  the  outside,  and  there  is  a  discontinuity  at  the 
north  pole.  Points  close  together  on  this  map  are  close  together  in  the  original  space,  but 
the  converse  is  not  necessarily  true.  This  map  provides  a  two  dimensional  summary  of  the 
original  data.  If  we  are  presented  with  any  new  observations,  wr:  can  easily  locate  them  on 
the  map  by  finding  their  closest  position  on  the  surface. 


Example  6.5.  One  dimensional  color  data. 

This  almost  real  data  example  illustrates: 


Second  principal  azii 
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Figure  6.5a  The  4  dimensional  color  date  Figure  6.5b  The  estimated  co-ordinate 
projected  onto  the  first  principal  component  planer  functions  plotted  against  the  arc  length  of  the 
The  principal  curve,  found  in  the  original  four-  principal  curve.  This  X  will  be  monotone  with 
space,  is  also  projected  onto  this  plane.  the  true  waveleugt'u. 

•  A  principal  curves  in  4-space,  and 

e  a  one  dimensional  MDS  example. 

These  data  were  t'sed  by  Shepard  and  Carroll  (1966)  (who  cite  the  original  source  as  Boynton 
and  Gordon  (1965))  to  illustrate  a  veroion  of  their  parametric  data  representation  techniques 
called  proximity  analysis.  We  give  more  details  of  this  technique  in  chapter  7. 

Each  of  the  23  observations  represents  a  spectral  color  at  a  specific  wavelength.  Each 
observation  has  4  psychological  variables  associated  with  it.  They  are  the  relative  frequen¬ 
cies  with  which  100  observers  named  the  color  blue,  green,  yellow  and  red.  As  can  be  seen  in 
figure  6.5a*  there  is  very  little  error  in  this  data,  and  it  is  one  dimensional  by  construction. 
Since  the  color  changes  slowly  with  wavelength,  so  should  these  relative  frequencies,  and 
they  should  thus  fall  on  a  one  dimensional  curve,  as  they  do.  The  data,  by  construction  lies 
in  a  3  dimensional  simplex  since  the  four  variables  add  up  to  1.  The  pictures  we  show  are 
projections  of  this  simplex  onto  the  2-D  subspace  spanned  by  the  first  two  linear  principal 
components.  Figure  6.5a  shows  the  solution  curve  and  figure  6.5b  shows  the  recovered 
parameters  and  co-ordinate  functions.  This  solution  is  in  qualitative  agreement  with  the 
data  and  with  the  solution  produced  by  Shepard  and  Carroll. 
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Example  6.6.  Lipoprotein  data. 

This  real  data  example  illustrates: 

•  A  principal  surface  in  3  space  with  some  interpretations, 

•  a  principal  curve  suggested  by  the  surface,  and 

•  co-ordinate  plots  for  surfaces. 

Williams  and  Krause  (1982)  conducted  a  study  to  investigate  the  inter-relationships  between 
the  serum  concentrations  of  lipoproteins  at  varying  densities  in  sedentry  men.  We  focus 
on  a  subset  of  the  data,  and  consider  the  serum  concentrations  of  LDL  3-4  (Low  Density 
Lipoprotein  with  flotation  rates  between  5/3  —  4),  LDL  7-8,  and  HDL  3  (High  Density 
Lipoprotein)  in  the  sample  of  81  men.  Figures  6.6a-d  are  different  views  of  the  principal 
surface  found  for  the  data.  Quantitively  this  surface  explains  97.4%  of  the  variability  in  the 
data,  and  accounts  for  80%  of  the  residual  variance  unexplained  by  the  principal  component . 
plane.  Qualitatively,  we  see  that  the  surface  has  interesting  structure  in  only  two  of  the 
co-ordinates,  namely  LDL  3-4  and  LDL  7-8.  We  can  infer  from  the  the  surface  that  the 
bow  shaped  relationship  between  these  two  variables  does  not  change  for  varying  levels  of 
HDL  3.  It  exhibits  an  independent  behaviour.  We  have  included  a  co-ordinate  plot  (figure 
6.6e)  of  the  estimated  co-ordinate  function  for  the  variables  LDL  7-8  which  helps  confirm 
this  claim-  The  relationship  between  LDL  7-8  and  (Aj,Aj)  depends  mainly  on  the  level  of 
Similar  information  is  conveyed  by  the  other  coordinate  plots,  or  can  be  seen  from  the 
estimated  surface  directly.  This  suggests  a  model  of  the  form 

/ LDL  3-4 \  //i(Ax)\  /ei\ 

LDL  7-8  =  /j(Ai)  +  e2  , 

V  HDL  3  /  V/3(A3);  [ej 

As  specified  Aj  is  confounded  with  HDL  3,  and  is  thus  unidentifiable.  We  need  to  estimate 
the  first  two  components  of  the  model  This  is  a  principal  curve  model,  and  figure  6.6f 
shows  the  estimated  curve.  It  exhibits  the  same  dependence  between  LDL  7-8  and  LDL  3-4 
as  did  the  surface.  The  curve  explains  92.6%  of  the  variance  in  the  two  variables,  whereas 
the  principal  component  line  explains  only  80%. 

Williams  and  Krauss  performed  a  similar  analysis  look;,  J  at  pairs  of  variables  at  a 
time.  We  discuss  their  techniques  in  chapter  7.  Their  results  are  qualitatively  the  same  as 
ours  for  the  LDL  pair. 


[/W* 


Figure  6.6a  The  principal  surface  for  the 
serum  concentrations  LDL  7-8,  LDL  3-4  and 
HDL  3  in  a  sample  of  81  sedentary  men.  Vari¬ 
able  HDL  3  is  into  the  page. 


Figure  6.6b  The  principal  surface  as  in 
figure  8.6a  from  a  different  viewpoint.  Variable 
LDL  7-8  is  into  the  page. 


Figure  6.6c  The  principal  surface  as  in 
figure  6.6a  from  a  different  viewpoint.  Variable 
LDL  3*4  is  into  the  page. 


Figure  6.6d  The  principal  surface  as  in 
figure  6.6a  from  a  slightly  oblique  perspective. 
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Figure  8.6e  The  estimated  co-ordinate 
function  for  LDL  7-8  versus  X.  X?  has  little 
effect. 


Figure  6.6f  The  principal  curve  for  the 
serum  concentrations  LDL  7-8  and  LDL  3-4  in 
a  sample  of  81  seaentry  men. 


Chapter  7 


Discussion  and  conclusions 


In  this  chapter  we  discuss  some  of  the  existing  techniques  for  symmetric  smoothing,  as  well 
as  the  various  generalizations  of  principal  components  and  factor  analysis.  We  compare 
these  techniques  with  the  methodolcgy  developed  here.  The  chapter  concludes  with  a 
summary  of  the  uses  of  principal  curves  and  surfaces. 

7.1.  Alternative  techniques. 

Other  non-linear  generalizations  of  principal  components  exist  in  the  literature.  They  can 
be  broadly  classified  according  to  two  dichotomies. 

•  We  can  estimate  either  the  non-linear  manifold  or  the  non-lincar  constraint  that  defines 
the  manifold.  In  linear  principal  components  the  approaches  are  equivalent. 

•  The  non-linearity  can  be  achieved  by  transforming  the  space  or  by  transforming  the 
model. 

The  principal  curve  and  surface  procedures  model  the  non-linear  manifold  by  transforming 
the  model. 

7.1.1.  Generalized  linear  principal  components. 

This  approach  corresponds  to  modeling  either  the  nonlinear  constraint  or  the  manifold  by 
transforming  the  apace.  The  idea  here  is  to  introduce  some  extra  variables,  where  each  new 
variable  is  some  non-linear  transformation  of  the  existing  co-ordinates.  One  then  seeks  a 
subspace  of  this  non  linear  co-ordinate  system  that  models  the  data  well.  The  subspace 
is  found  by  using  the  usual  linear  eigenvector  solution  in  ths  new  enlarged  space.  This 
technique  was  first  suggested  by  Guanadesikan  &&  Wilk  (1966, 1968),  and  a  good  description 
can  be  found  in  Gnanadesikan  (1977).  They  suggested  using  polynomial  functions  of  the 
original  p  co-ordinates.  The  resulting  linear  combinations  are  then  of  the  form  (  for  p  w  2 
and  quadratic  polynomials) 

Xj  s*  ayxj  +  aj,ij  +  a3yzix2  +  atjx\  +  ag;Xj  (7.1) 


Chapter  7:  Discussion  and  conclusions  89 

and  the  Oj  will  be  eigenvectors  of  the  appropriate  covariance  matrix. 

This  model  has  appeal  mainly  as  a  dimension  reducing  tool.  Typically  the  Unear. 
combination  with  the  smallest  variance  is  set  to  zero.  This  results  in  an  implicit  non-linear 
constraint  equation  as  in  (7.1)  where  we  set  A  =  0.  We  then  have  a  rank  one  reduction 
that  tells  us  that  the  data  Ues  dose  to  a  quadratic  manifold  in  the  original  co-ordinates. 

The  model  has  been  generalised  further  to  indude  more  general  transformations  of 
the  co-ordinates  other  than  quadratic,  but  the  idea  is  essentially  the  same  as  the  above;  a 
linear  solution  is  found  in  a  transformed  space.  Young,  Takane  he.  de  Leeuw  (1978)  and  later 
Friedman  (1983)  suggested  different  forms  of  this  generalization  to  indude  non-parametric 
transformations  of  the  co-ordinates.  The  problem  can  be  formulated  as  follows:  Find  a  and 
**(*)  =  (*l(*l).  •*•»**(»>))  such  that 

X  ![«(*)  -  «•'*(*)!*  =  min!  (7.2) 

or  alternatively  such  that 

Var  [e's(x)}  =  max!  (7.3) 

where  Ksj-(x,-)  —  0,  e'e  =  1  and  Xs?(zj)  =  1.  The  idea  is  to  transform  the  coordinates 
suitably  and  then  find  the  linear  prindpa!  components,  if  in  (7.3)  we  replaced  max  by  mt* 
then  we  would  be  setimating  the  constraint  in  the  transformed  space. 

The  estimation  procedure  alternates  between  $>v‘_g  the  s;(-)  and  finding  the  linear 
principal  components  in  the  transformed  space 

e  For  a  fixed  vector  of  functions  s(  rv  'hone  a  to  be  the  first  principal  component  of 
the  covariance  matrix  K*(s)r(<:j'. 

e  For  «  known,  (7.2)  can  be  written  in  the  form 

9 

k  X(et(xi)  -  £ *>)!*-!•  •'•«<  *  in  •*(•),*••, «*(•),  (7.4) 

>■* 

and  hji  are  functions  of  e  above.  lfsj,..v,s  a»«.  known,  equation  (7.4)  is  minimized 

by  , 

•»(*«)  * 

f- * 

This  is  true  for  any  *j,  and  suggest*  an  inner  iterative  loop.  This  inner  loop  is  very 
wmilar  to  the  ACE  algorithm  (Breiman  and  Friedman,  1982),  except  the  normalization 
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is  slightly  different.  Breimaa  and  Friedman  proved  that  the  ACE  algorithm  converges 
under  certain  regularity  conditions  in  the  distributional  case. 

The  disadvantages  of  this  technique  are: 

•  The  space  is  transformed,  and  in  order  to  understand  the  resultant  fit,  we  usually 
would  need  to  transform  back  to  the  original  space.  This  can  only  be  achieved  if  the 
transformations  are  restricted  to  monotone  functions.  In  the  transformed  space  the 
estimated  manifold  is  gives  by 

:  =  aa'*{x). 

Thus  if  the  *,-(•)  are  monotone,  we  get  untransformed  estimates  of  the  form 

m.rr’ 

UJ  WM 

where  s  *=  •'»(*).  Equation  (7.5)  defines  a  parametrised  curve.  The  curve  is  not 
completely  general  since  the  co-ordinate  functions  an  monotone.  Far  the  same  reason, 
Gnanadeeikan  (1978)  expressed  the  desirability  of  having  procedures  for  estimating 
models  of  the  type  proposed  in  this  dissertation. 

e  We  are  estimating  manifolds  that  are  does  to  the  data  in  the  transformed  co-ordinates. 
When  the  transformations  are  non-linear  this  can  result  in  distortion  of  the  error 
variances  for  individual  variables.  What  we  really  require  is  a  method  far  estimating 
manifolds  that  are  dose  to  the  data  in  the  original  p  co-ordinates  Of  course,  if  the , 
functions  are  linear,  both  approaches  are  identical. 

An  advantage  of  the  technique  is  that  it  can  easily  be  generalised  to  take  care  at  higher 
dimensional  manifolds,  although  not  in  an  entirely  general  fashion.  This  is  achieved  by 
replacing  n  with  A  where  A  ia  p  x  q  .  We  then  get  a  f  dimensional  hyperplane  in  thei 
transformed  space  given  by  AA'efa).  However,  we  end  up  with  a  number  at  implicit 
constraint  equations  which  are  hard  to  deal  with  and  interpret.  Despite  the  problems 
associated  with  generalised  priori  pal  components,  it  remains  a  useful  tool  for  performing 
rank  1  dimensionality  reductions. 
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7.1.2.  Multi-dimensional  scaling. 

This  is  a  technique  for  finding  a  low  dimensional  representation  of  high  dimensional  data. 
The  original  proposal  was  for  data  that  consists  of  (J)  dissimilarities  or  distances  between  n 
objects.  The  idea  is  to  find  a  m  (m  small,  1,  2  or  3)  dimensional  euclidean  representation  for 
the  objects  such  that  the  inter-object  distances  are  preserved  as  well  as  possible.  The  idea 
was  introduced  by  Torgerson  (1953),  and  followed  up  by  Shepard  (1962),  Kruskal  (1964a 
,1964b),  Shepard  it  Kruskal  (1964)  and  Shepard  it  Carroll  (1966).  Gnanadcsikan  (1978) 
gives  a  concise  description. 

The  procedures  have  also  been  suggested  for  situations  where  we  simply  want  a  lower 
Jim«niiinn«l  representation  of  high  dimensional  euclidean  data.  The  lower  dimensional 
representation  attempts  to  reproduce  the  interpoint  distances  in  the  original  space.  We 
fit  a  principal  curve  to  the  color  data  in  example  6.5;  these  data  were  originally  analyzed 
by  Shepard  and  Carroll  (1966)  using  MDS  techniques.  Although  there  have  been  some 
intriguing  example.'  at  the  technique  in  the  literature,  a  number  of  problems  exist. 

e  The  eohitioa  — of  a  vector  of  m  co-ordinate*  representing  the  location  of  points 
on  the  low  dimensional  manifold,  but  only  for  the  n  data  points.  What  we  don’t  get, 
and  often  desire  is  a  mapping  of  the  whole  space.  We  are  unable,  for  example,  to  find 
the  location  of  new  paints  in  the  reduced  space. 

•  The  procedures  are  computationally  expensive  and  unfeasible  for  large  n  (nm  >  300 
is  considered  large).  They  are  usually  expressed  as  non-linear  optimization  problems 
in  nm  parameters,  and  differ  in  the  choice  of  criterion. 

The  principal  curve  and  surface  procedures  partially  overcome  both  the  problems  listed 
above;  they  are  unable  to  find  structures  as  general  as  those  that  can  be  found  by  the  MDS 
procedures  due  to  the  averaging  nature  of  the  scatterplot  smoothers,  but  they  do  provide 
a  mapping  for  the  space.  We  have  demonstrated  their  ability  to  model  MDS  type  data  in 
examples  6.4  and  6.5.  They  do  not,  however,  provide  a  model  for  dissimilarities  which  was 
the  original  intention  of  multidimensional  scaling. 

7.1.3.  Proximity  models. 

Shepard  it  Carroll  (1966)  suggested  a  functional  model  similar  in  form  to  the  model  we 
suggest.  They  required  only  to  estimate  the  n  vectors  of  m  parameters  for  each  point,  and 
ronsidered  the  data  to  be  functions  thereof  The  parameters  (nm  altogether)  are  found 
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by  direct  search  as  in  MDS,  with  a  different  criterion  to  be  minimized.  Their  procedure, 
however,  was  geared  towards  data  without  error,  as  in  the  ball  data  in  example  6.4.  This 
becomes  evident  when  one  examines  the  criterion  they  used,  which  measures  the  continuity 
of  the  data  as  a  function  of  the  parameters.  When  the  data  is  not  smooth,  as  is  usually  the 
case,  we  need  to  estimate  functions  that  vary  smoothly  with  the  parameters,  and  are  close 
to  the  data. 

7.1.4.  Non-linear  factor  analysis. 

More  recently,  Etesadi-Amoli  and  McDonald  (1983)  approached  the  problem  of  non-linear 
factor  analysis  using  polynomial  functions.  They  use  a  model  of  the  form 

X  =  f(X)+t 

where  /  is  a  polynomial  in  the  unknown  parameters  or  factors.  Their  procedure  for  esti¬ 
mating  the  unknown  factors  and  coefficients  is  similar  to  ours  in  this  restricted  setting.  * 
Their  emphasis  is  on  the  factor  analysis  model,  and  once  the  appropriate  polynomial  terms 
have  been  found,  the  problem  is  treated  as  an  enlarged  factor  analysis  problem.  They  do 
not  estimate  the  A’s  as  we  do,  using  the  geometry  of  the  problem,  but  instead  perform  a 
search  in  nf  parameter  space,  where  q  is  the  dimension  of  A  and  n  is  the  number  of  obser¬ 
vations.  Our  emphasis  is  on  providing  one  and  two  dimensional  summaries  of  the  data.  In 
certain  situations,  these  summaries  can  be  used  as  estimates  of  the  appropriate  non-linear 
functional  and  factor  models. 

7.1.5.  Axis  interchangeable  smoothing. 

Cleveland  (1983)  describes  a  technique  for  symmetrically  smoothing  a  scatterplot  which  he 
calls  saw  interchangeable  emootkinq  (  which  we  will  refer  to  as  AI  smoothing).  We  briefly 
oathne  the  idea: 

•  standardise  each  coordinate  by  some  (robust)  measure  of  scale. 

e  rotate  the  coordinate  axes  by  45*.  (if  the  correlation  is  positive,  else  rotate  through 

-45*). 

•  smooth  ths  transformed  y  against  the  transformed  z. 

•  Their  paper  was  published  ia  the  September,  1983  issue  of  Psycbometrika,  whereas  Hastie 
(1963)  appeared  ia  Jaljr. 
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•  rotate  the  axes  back. 


«  unstandardize. 

If  the  standardisation  uses  regular  standard  deviations,  then  the  rotation  is  simply  a  change 
of  basis  to  the  principal  component  basis.  The  resulting  curve  minimises  the  distance  from 
the  points  orthogonal  to  this  principal  component.  It  has  intuitive  appeal  since  the  principal 
component  is  the  line  that  is  closest  in  distance  to  the  points.  We  then  allow  the  points 
to  tug’  in  the  principal  component  line.  It  is  simple  and  fast  to  compute  the  AI  Smooth, 
and  for  many  seatterplots  it  produces  curves  that  are  very  similar  to  the  principal  curve 
solution.  This  is  not  surprising  when  we  consider  the  following  theorem: 

Theorem  7.1 

If  the  two  variables  in  a  seatterplot  are  standardized  to  have  unit  standard  deviations, 
and  if  the  smoother  used  is  linear  and  reproduces  straight  lines  exactly,  then  the  axis 
interchangeable  smooth  is  identical  to  the  curve  of  the  first  iteration  of  the  principal  curve 
procedure. 


Proof 

Let  the  variables  x  and  y  be  standardized  as  above.  The  AI  Smooth  transforms  to  two 
new  variables 

«•  =  (*  +  ») 

.  ^  .  (7.6) 

’  ~  y/i 

Then  the  AI  Smooth  replaces  (s*,  y*)  by  (x*,  Smooth  (y*  [**)).  But  Smooth (x*  |x*)  = 
s*  since  the  smoother  reproduces  straight  lines  exactly.*  Thus  the  AI  Smooth  transforms 
back  to 

^  _(  Smooth  (x*  |  x*)  +  Smooth  (y*  |  **) 

A  _  (  Smooth  (**  |**)  —  Smooth  (y*  |x*)) 
f=  *  :  . 

Since  the  smoother  is  linear,  and  in  view  of  (7.6)  ,  (7.7)  becomes 

ftss  Smooth  (x  |  x*) 

(7.8) 

ft  =  Smooth  (y  j  x*) 


*  Any  weighted  local  fiaear  smoother  has  this  property.  Local  averages,  however,  do  not  unless 
the  predictors  are  evenly  spaced. 
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This  is  exactly  the  curve  found  after  the  first  iteration  of  the  principal  curve  procedure, 
since  A(°)  =  *\  I 

Williams  and  Krauss  (1982)  extended  the  AI  smooth  by  iterating  the  procedure.  At 
the  second  step,  the  residuals  are  calculated  locally  by  finding  the  tangent  to  the  curve  at 
each  point  and  evaluating  the  residuals  from  these  tangents.  The  new  fit  at  that  point  is 
the  smooth  of  these  residuals  against  their  projection  onto  the  tangent.  This  procedure 
would  probably  get  closer  to  the  principal  curve  solution  than  the  AI  smooth  (we  have 
not  implemented  the  Williams  and  Krauss  smooth).  Analytically  one  can  see  that  the 
procedures  differ  from  the  second  step  on. 

This  particular  approach  to  symmetric  smoothing  (in  terms  of  residuals  )  suffers  from 
several  deficiencies  : 

•  the  type  of  curves  that  can  be  found  are  not  as  general  as  those  found  by  the  principal 
curve  procedure. 

•  they  are  designed  for  seatterplots  and  do  not  generalize  to  curves  in  higher  dimensions. 

•  they  lack  the  interpretation  of  principal  curves  as  a  form  of  conditional,  expectation. 

T.2.  Conclusions.  . 

In  conclusion  we  summarize  the  role  of  principal  curves  and  surfaces  in  statistics  and  data 

analysts. 

•  They  generalise  the  one  and  two  dimensional  summaries  of  multivariate  data  usually 
provided  by  the  principal  components. 

•  When  the  principal  curves  and  surface  are  linear,  they  are  the  principal  component 
summaries. 

•  Locally  they  are  the  critical  points  of  the  usual  distance  function  for  such  summaries; 
this  gives  an  indication  that  there  are  not  too  many  of  them. 

•  They  are  defined  in  terms  of  conditional  expectations  which  satisfies  our  mental  image 
of  a  summary. 

e  They  provide  the  least  squares  estimate  for  generalized  versions  of  factor  analysis, 
functional  models  and  the  errors  in  variables  regression  models.  The  non-linear  errors 
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in  variables  model  has  been  used  successfully  a  number  of  times  in  practical  data 
analysis  problems  (notably  calibration  problems). 

•  In  some  situations  they  are  a  useful  alternative  to  MDS  techniques,  in  that  they  provide 
a  lower  dimensional  summary  of  the  space  as  opposed  to  the  data  set. 

0  In  some  situations  they  can  be  effective  in  identifying  outliers  in  higher  dimensional 

space. 

•  They  are  a  useful  data  exploratory  tool.  Motion  graphics  techniques  have  become 
popular  for  looking  at  3  dimensional  point  clouds.  Experience  show  that  it  is  often 
impossible  to  identify  certain  structures  in  the  data  by  simply  rotating  the  points.  A 
summary  such  as  that  given  by  the  principal  curve  and  surfaces  can  identify  structures 
that  woujd  otherwise  be  transparent,  even  if  the  data  could  be  viewed  in  a  real  three 
dimensional  model. 
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